Geography Reference
In-Depth Information
method would generate nearly 20 billion dyads with only around 60,000 realized cita-
tions. In addition, this approach raises questions regarding network autocorrelation
and the non-independence of repeated observations on the same patents across multiple
observations in the error structure.
Instead, our analysis follows Sorenson and Stuart (2001) in adopting a case-control
approach to analyzing the formation of ties (see Sorenson and Fleming, 2004, for an
earlier application to patents). The case-control sampling procedure works as follows.
We begin by including all cases of future patents, from July 1990 to June 1996, that cite
any of our 17,268 focal patents: 60,999 in total. Since these citations occur, the depend-
ent variable Cite ij takes a value of '1' for these cases to denote a realized citation. In addi-
tion, we pair each focal patent with four future patents that do not cite it (but that could
have). 6 We set Cite ij to zero for these control cases. Though this generates a data set of
130,055 dyads, our analysis restricts the sample used for estimation to the 72,801 cases
where both inventors reside in the US. 7 To address the fact that focal patents enter the
data more than once, we report robust standard errors estimated without the assumption
of independence across repeated observations of the same focal patent.
The use of a matched sample introduces one new problem. Logistic regression can
yield biased estimates when the proportion of positive outcomes in the sample does not
match the proportion of citations in the population (Prentice and Pyke, 1979; Scott and
Wild, 1997). In particular, uncorrected logistic regression using a matched sample tends
to produce underestimates of the factors that predict a positive outcome (King and Zeng,
2001). Large samples do not necessarily alleviate this problem.
We adjust the coei cient estimates using the method proposed by King and Zeng
(2001) for the logistic regression of rare events (cf. Manski and Lerman, 1977). The tradi-
tional logistic regression model considers the dichotomous outcome variable a Bernoulli
probability function that takes a value 1 with the probability p:
p i 5 1
1 1 e 2 X i b ,
where X represents a vector of covariates and b denotes a vector of parameters.
Researchers typically use maximum likelihood methods to estimate b. King and Zeng
(2001) prove that the following weighted least squares expression estimates the bias in b
generated by oversampling rare events:
bias ( ^ ) 5 ( X
WX ) 21 X
W x,
r
r
where x = 0.5 Q ii [ ( 1 1 w 1 ) p
^ i 2 w 1 ] , the Q are the diagonal elements of Q 5 X ( X
WX ) 21 X
,
r
r
^ i ) w i } , and w 1 represents the fraction of ones (citations) in the sample
relative to the fraction in the population. At an intuitive level, one regresses the inde-
pendent variables on the residuals using W as the weighting factor. Tomz (1999) imple-
ments this method in the relogit Stata procedure.
This case-control approach of ers two principal advantages over the count models
employed in most patent research. First, this method permits far more i ne-grained con-
trols for heterogeneity in citing patents. Count models preclude the possibility of con-
trolling for detailed features of a citing patent. The ability to account for the attributes
of the potential citing patents proves critical, however, to testing our hypotheses, which
W 5 diag { p
^ i ( 1 2 p
Search WWH ::




Custom Search