STATISTICAL INFERENCE

Making an inference involves drawing a general conclusion from specific observations. People do this every day. Upon arising in the morning, one observes that the sun is shining and that the day will be nice. The news reports the arrest of a military veteran for child abuse, and a listener infers that military veterans have special adjustment problems. Statistical inference is a way of formalizing the process of drawing general conclusions from limited information. It is a way of stating the degree of confidence one has in making an inference by using probability theory. Statistically based research allows people to move beyond speculation.

Suppose a sociologist interviews two husbands. Josh, whose wife is employed, does 50 percent of the household chores; Frank, whose wife does not work for pay, does 10 percent. Should the sociologist infer that husbands do more housework when their wives are employed? No. This difference could happen by chance with only two cases. However, what if 500 randomly selected husbands with employed wives average 50 percent of the chores and randomly selected husbands with nonemployed wives average 10 percent? Since this difference is not likely to occur by chance, the sociologist infers that husbands do more housework when their wives are employed for pay.

Researchers perform statistical inferences in three different ways. Assume that 60 percent of the respondents to a survey say they will vote for Marie Chavez. The traditional hypothesis testing approach infers that Chavez will win the election if chance processes would account for the result (60 percent support in this survey) with less than some a priori specified statistical significance level. For example, if random chance could account for the result fewer than five times in a hundred, one would say the results are statistically significant. Statistical significance levels are called the alpha (e.g., a = .05 for the 5 percent level). If Chavez would get 60 percent support in a sample of the size selected less than 5 percent of the time by chance, one would infer that she will win. The researcher picked the 5 percent level of significance before doing the survey. (The test, including the a level, must be planned before one looks at the findings.) If one would get this result 6 percent of the time by chance, there is no inference. Note that not making the inference meansjust that: One does not infer that Chavez’s opponent will win.

A second strategy involves stating the likelihood of the result occurring by chance without an a priori level of significance. This strategy reports the result (60 percent of the sample supported Chavez) and the probability of getting that result by chance, say, .042. This gives readers the freedom to make their inferences using whatever level of significance they wish. Sam Jones, using the .01 level (a = .01) in the traditional approach would see that the results do not meet his criterion. He would not conclude that Chavez will win. Mara Jabar, using the .05 level, would conclude that Chavez would win.

The third strategy places a confidence interval around a result. For example, a researcher may be 95 percent confident that Chavez will get between 55 percent and 65 percent of the votes. Since the entire interval—55 percent to 65 percent—is enough for a victory, that is, is greater than 50 percent one infers that Chavez will win.

Each approach has an element of risk attached to the inference. That risk is the probability of getting the result by chance alone. Sociologists tend to pick low probabilities (e.g., .05, .01, and even .001), because they do not want to conclude that something is true when it is at all likely to have occurred by chance.

TRADITIONAL TESTS OF SIGNIFICANCE

Traditional tests of significance involve six steps. Three examples are used here to illustrate these steps: (1) A candidate will win an election, (2) mothers with at least one daughter will have different views on abortion than will mothers with only sons, and (3) the greater a person’s internal political efficacy is, the more likely that person is to vote.

Step 1: State a hypotheses (H1) in terms of statistical parameters (characteristics such as means, correlations, proportions) of the population:

H1: P(vote for the candidate) < .50. [Read: The mean for mothers with daughters is not equal to the mean for mothers with sons.]

H2: | mothers with daughters ^ | mothers with sons. [Read: The means for mothers with daughters is not equal to the mean for mothers with sons.]

H3: p < 0.0. [Read: The popluation correlation p (rho) between internal political efficacy and voting is greater than zero.]

H2 says that the means are different but does not specify the direction of the difference. This is a two-tail hypothesis, meaning that it can be significant in either direction. In contrast, H1 and H2 signify the direction of the difference and are called one-tail hypotheses.

These three hypotheses are not directly testable because each involves a range of values. Step 2 states a null hypothesis, which the researcher usually wishes to reject, that has a specific value.

An important difference between one-tail and two-tail tests may have crossed the reader;s mind. Consider H10. If 40 percent of the sample supported the candidate, one fails to reject H10 because the result was in the direction opposite of that of the one-tail hypothesis. In contrast, whether mothers with daughters have a higher or lower mean attitude toward abortion than do mothers with sons, one proceeds to test H20 because a difference in either direction could be significant.

Step 3 states the a priori level of significance. Sociologists usually use the .05 level. With large samples, they sometimes use the .01 or .001 level. This paper uses the .05 level (a = .05). If the result would occur in fewer than 5 percent (corresponding to the .05 level) of the samples if the null hypothesis were true in the population, the null hypothesis is rejected in favor of the main hypothesis.

Suppose the sample correlation between internal political efficacy and voting is .56 and this would occur in fewer than 5 percent of the samples this size if the population correlation were 0 (as specified in H30). One rejects the null hypothesis, H30, and accepts the main hypothesis, H3, that the variables are correlated in the population. What if the sample correlation were .13 and a correlation this large would occur in 25 percent of the samples from a population in which the true correlation were 0? Because 25 percent exceeds the a priori significance level of 5 percent, the null hypothesis is not rejected. One cannot infer that the variables are correlated in the population. Simultaneously, the results do not prove that the population correlation is .00, simply that it could be that value.

Step 4 selects a test statistic and its critical value. Common test statistics include z, t, F, and %2 (chi-square). The critical value is the value the test statistic must exceed to be significant at the level specified in step 3. For example, using a one-tail hypothesis, a z must exceed 1.645 to be significant at the .05 level. Using a two-tail hypothesis, a z, must exceed 1.96 to be significant at the .05 level. For t, F, and %2, determining the critical value is more complicated because one needs to know the degrees of freedom. A formal understanding of degrees of freedom is beyond the scope of this article, but an example will give the reader an intuitive idea. If the mean of five cases is 4 and four of the cases have values of 1, 4, 5, and 2, the last case must have a value of 8 (it is the only value for the fifth case that will give a mean of 4, since 1 + 4 + 5 + 2 + * = 20, only if * = 8 and 20/5 = 4). Thus, there are n – 1 degrees of freedom. Most test statistics have different distributions for each number of degrees of freedom.

Figure 1 illustrates the z distribution. Under the z distribution, an absolute value of greater than 1.96 will occur by chance only 5 percent of the time. By chance a z > 1.96 occurs 2.5 percent of the time and a z < – 1.96 occurs 2.5 percent of the time. Thus, 1.96 is the critical z-score for a two-tail .05 level test. The critical z-score for a one-tail test at the .05 level is 1.645 or – 1.645, depending on the direction specified in the main hypothesis.

Step 5 computes the test statistic. An example appears below.

Step 6 decides whether to reject or fail to reject the null hypothesis. If the computed test statistic exceeds the critical value, one rejects the null hypothesis and makes the inference to accept the main hypothesis. If the computed test statistic does not exceed the critical value, one fails to reject the null hypothesis and make no inference.

Example of Six Steps Applied to H1. A random sample of 200 voters shows 60 percent of them supporting the candidate. Having stated the main hypothesis (step 1) and the null hypothesis (step 2), step 3 selects an a priori significance level at a = .05, since this is the conventional level. Step 4 selects the test statistic and its critical level. To test a single percentage, a z test is used (standard textbooks on social statistics discuss how to select the appropriate tests statistics; see Agresti and Finlay 1996; Loether and McTavish 1993; Raymondo 1999; Vaughan 1997). Since the hypothesis is one-tail, the critical value is 1.645 (see Figure 1).

The fifth step computes the formula for the test statistic:

Thus,

The sixth step makes the decision to reject the null hypothesis, since the difference is in the predicted direction and 2.828 > 1.645. The statistical inference is that the candidate will win the election.

REPORTING THE PROBABILITY LEVEL

Many sociological researchers do not use the traditional null hypothesis model. Instead, they report the probability of the result. This way, a reader knows the probability (say, .042 or .058) rather than the significant versus not significant status. Reporting the probability level removes the ”magic of the level of significance.” A result that is significant at the .058 level is not categorically different from one that is significant at the .042 level. Where the traditional null hypothesis approach says that the first of these results is not significant and the second is, reporting the probability tells the reader that there is only a small difference in the degree of confidence attached to the two results. Critics of this strategy argue that the reader may adjust the significance level post hoc; that is, the reader may raise or lower the level of significance after seeing the results. It also is argued that it is the researcher, not the reader, who is the person testing the hypotheses; therefore, the researcher is responsible for selecting an a priori level of significance.

Figure 1. Normal deviate (z) distribution.

The strategy of reporting the probability is illustrated for H1. Using the tabled values or functions in standard statistical packages, the one-tail probability of a z = 2.828 is .002. The researcher reports that the candidate had 60 percent of the vote in the sample and that the probability of getting that much support by chance is .002. This provides more information than does simply saying that it is significant at the .05 level. Results that could happen only twice in 1,000 times by chance (.002) are more compelling than are results that could happen five times in 100 (.05).

Since journal editors want to keep papers short and studies often include many tests of significance, reporting probabilities is far more efficient than going through the six-step process outlined above. The researcher must go through these steps, but the paper merely reports the probability for each test and places an asterisk along those which are significant at the .05 level. Some researchers place a single asterisk for results significant at the .05 level, two asterisks for results significant at the .01 level, and three asterisks for results significant at the .001 level.

CONFIDENCE INTERVALS

Rather than reporting the significance of a result, this approach puts a confidence interval around the result. This provides additional information in terms of the width of the confidence interval.

Using a confidence interval, a person constructs a range of values such that he or she is 95 percent confident (some use a 99 percent confidence interval) that the range contains the population parameter. The confidence interval uses a two-tail approach on the assumption that the population value can be either above or below the sample value.

For the election example, H1, the confidence interval is

The researcher is 95 percent confident that the interval, .531 to .669, contains the true population proportion. The focus is on the confidence level (.95) for a result rather than the low likelihood of the null hypothesis (.05) used in the traditional null hypothesis testing approach.

The confidence interval has more information value than do the first two approaches. Since the value specified in the null hypothesis (H0: P = .50) is not in the confidence interval, the result is statistically significant at the .05 level. Note that a 95 percent confidence level corresponds to a .05 level of significance and that a 99 percent confidence interval corresponds to a .01 level of significance. Whenever the value specified by the null hypothesis is not in the confidence interval, the result is statistically significant. More important, the confidence interval provides an estimate of the range of possible values for the population. With 200 cases and 60 percent support, there is confidence that the candidate will win, although it may be a close election with the lower limit indicating 53.1 percent of the vote or a landslide with the upper limit indicating 66.9 percent of the vote. If the sample were four times as large, n = 800, the confidence interval would be half as wide (.565-.635) and would give a better fix on the outcome.

COMPUTATION OF TESTS AND CONFIDENCE INTERVALS

Table 1 presents formulas for some common tests of significance and their corresponding confidence intervals where appropriate. These are only a sample of the tests that are commonly used, but they cover means, differences of means, proportions, differences of proportions, contingency tables, and correlations. Not included are a variety of multivariate tests for analysis of variance, regression, path analysis, and structural equation models. The formulas shown in Table 1 are elaborated in most standard statistics textbooks (Agresti and Finlay 1996; Blalock 1979; Bohrnstedt and Knoke 1998: Loether and McTavish 1993; Raymondo 1999; Vaughan 1997).

LOGIC OF STATISTICAL INFERENCE

A formal treatment of the logic of statistical inference is beyond the scope of this article; the following is a simplified description. Suppose one wants to know whether a telephone survey can be thought of as a random sample. From current census information, suppose the mean, income of the community is $31,800 and the standard deviation, o, is $12,000. A graph of the complete census enumeration appears in Panel A of Figure 2. The fact that there are a few very wealthy people skews the distribution.

A telephone survey included interviews with 1,000 households. If it is random, its sample mean and standard deviation should be close to the population parameters, | and o, respectively. Assume that the sample has a mean of $33,200 and a standard deviation of $10,500. To distinguish these sample statistics from the population parameters, call them M and s. The sample distribution appears in Panel B by Figure 2. Note that it is similar to the population distribution but is not as smooth.

One cannot decide whether the sample could be random by looking at Panels A and B. The distributions are different, but this difference might have occurred by chance. Statistical inference is accomplished by introducing two theoretical distributions: the sampling distribution of the mean and the z-distribution of the normal deviate. A theoretical distribution is different from the population and sample distributions in that a theoretical distribution is mathematically derived; it is not observed directly.

Table 1:Common Tests of Significance Formulas

What is Being Tested?	^Hi	^Ho	Test Statistic	Large-Scale Confidence Interval
Single mean against value specified as z in Ho
Single proportion against value specified as z in H_o
Difference between two means
Difference between two proportions
Significance of contingency table
Single correlation

Sampling Distribution of the Mean. Suppose that instead of taking a single random sample of 1,000 people, one took two such samples and determined the mean of each one. With 1,000 cases, it is likely that the two samples would have means that were close together but not the same. For instance, the mean of the second sample might be $30,200. These means, $33,200 and $30,200, are pretty close to each other. For a sample to have a mean of, say $11,000, it would have to include a greatly disproportionate share of poor families; this is not likely by chance with a random sample with n = 1,000. For a sample to have a mean of, say, $115,000, it would have to have a greatly disproportionate share of rich families. In contrast, with a sample of just two individuals, one would not be surprised if the first person had an income of $11,000 and the second had an income of $115,000.

The larger the samples are, the more stable the mean is from one sample to the next. With only 20 people in the first and second samples, the means may vary a lot, but with 100,000 people in both samples, the means should be almost identical. Mathematically, it is possible to derive a distribution of the means of all possible samples of a given n even though only a single sample is observed. It can be shown that the mean of the sampling distribution of means is the population mean and that the standard deviation of the sampling distribution of the means is the population standard deviation divided by the square root of the sample size. The standard deviation of the mean is called the standard error of the mean:

Figure 2. Four distributions used in statistical inference: (A) population distribution; (B) sample distribution; sampling distribution for n=100 and n=1,000; and (D) normal deviate (z) distributions

This is an important derivation in statistical theory. Panel C shows the sampling distribution of the mean when the sample size is n = 1,000. It also shows the sampling distribution of the mean for n = 100. A remarkable property of the sampling distribution of the mean is that with a large sample size, it will be normally distributed even though the population and sample distributions are skewed.

One gets a general idea of how the sample did by seeing where the sample mean falls along the sampling distribution of the mean. Using Panel C for n = 1,000, the sample M = $33,200 is a long way from the population mean. Very few samples with n = 1,000 would have means this far way from the population mean. Thus, one infers that the sample mean probably is based on a nonrandom sample.

Using the distribution in Panel C for the smaller sample size, n = 100, the sample M = $33,200 is not so unusual. With 100 cases, one should not be surprised to get a sample mean this far from the population mean.

Being able to compare the sample mean to the population mean by using the sampling distribution is remarkable, but statistical theory allows more precision. One can transform the values in the sampling distribution of the mean to a distribution of a test statistic. The appropriate test statistic is the distribution of the normal deviate, or z-distribution. It can be shown that

If the z-value were computed for the mean of all possible samples taken at random from the population, it would be distributed as shown in Panel D of Figure 2. It will be normal, have a mean of zero, and have a variance of 1.

Where is M = $33,200 under the distribution of the normal deviate using the sample size of n = 1,000? Its z-score using the above formula is

Using tabled values for the normal deviate, the probability of a random sample of 1,000 cases from a population with a mean of $31,800 having a sample mean of $33,200 is less than .001. Thus, it is extremely unlikely that the sample is purely random.

With the same sample mean but with a sample of only 100 people,

Using tabled values for a two-tail test, the probability of getting the sample mean this far from the population mean with a sample of 100 people is .250. One should not infer that the sample is nonrandom, since these results could happen 25 percent of the time by chance.

The four distributions can be described for any sample statistic one wants to test (means, differences of means, proportions, differences of proportions, correlations, etc). While many of the calculations will be more complex, their logic is identical.

MULTIPLE TESTS OF SIGNIFICANCE

The logic of statistical inference applies to testing a single hypothesis. Since most studies include multiple tests, interpreting results can become extremely complex. If a researcher conducts 100 tests, 5 of them should yield results that are statistically significant at the .05 level by chance. Therefore, a study that includes many tests may find some ”interesting” results that appear statistically significant but that really are an artifact of the number of tests conducted.

Sociologists pay less attention to ”adjusting the error rate” than do those in most other scientific fields. A conservative approach is to divide the Type I error by the number of tests conducted. This is known as the Dunn multiple comparison test, based on the Bonferroni inequality. For example, instead of doing nine tests at the .05 level, each test is done at the .05/9 = .006 level. To be viewed as statistically significant at the .05 level, each specific test must be significant at the .006 level.

There are many specialized multiple comparison procedures, depending on whether the tests are planned before the study starts or after the results are known. Brown and Melamed (1990) describe these procedures.

POWER AND TYPE I AND TYPE II ERRORS

To this point, only one type of probability has been considered. Sociologists use statistical inference to minimize the chance of accepting a main hypothesis that is false in the population. They reject the null hypothesis only if the chances of it’s being true in the population are very small, say, a = .05. Still, by minimizing the chances of this error, sociologists increase the chance of failing to reject the null hypothesis when it should be rejected. Table 2 illustrates these two types of error.

Type I, or a, error is the probability of rejecting H0 falsely, that is, the error of deciding that H1 is right when H0 is true in the population. If one were testing whether a new program reduced drug abuse among pregnant women, the H1 would be that the program did this and the H0 would be that the program was no better than the existing one. Type I error should be minimized because it would be wrong to change programs when the new program was no better than the existing one. Type I error has been described as ”the chances of discovering things that aren’t so” (Cohen 1990, p. 1304). The focus on Type I error reflects a conservative view among scientists. Type I error guards against doing something new (as specified by H1) when it is not going to be helpful.

Table 2:Type I (a) and Type II (P) Errors

	True Situation in the Population
Decision Made by the Researcher	H₀, the null hypothesis, is true	H₁, the main hypothesis, is true

Type II, or B, error is the probability of failing to reject H0 when H1 is true in the population. If one failed to reject the null hypothesis that the new program was no better (H0) when it was truly better (H1), one would put newborn children at needless risk. Type II error is the chance of missing something new (as specified by H1) when it really would be helpful.

Power is 1 – B. Power measures the likelihood of rejecting the null hypothesis when the alternative hypothesis is true. Thus, if there is a real effect in the population, a study that has a power of .80 can reject the null hypothesis with a likelihood of .80. The power of a statistical test is measured by how likely it is to do what one usually wants to do: demonstrate support for the main hypothesis when the main hypothesis is true. Using the example of a treatment for drug abuse among pregnant women, the power of a test is the ability to demonstrate that the program is effective if this is really true.

Power can be increased. First, get a larger sample. The larger the sample, the more power to find results that exist in the population. Second, increase the a level. Rather than using the .01 level of significance, a researcher can pick the .05 or even the .10. The larger a is, the more powerful the test is in its ability to reject the null hypothesis when the alternative is true.

There are problems with both approaches. Increasing sample size makes the study more costly. If there are risks to the subjects who participate, adding cases exposes additional people to that risk. An example of this would be a study that exposed subjects to a new drug treatment program that might create more problems than it solved. A larger sample will expose more people to these risks.

Since Type I and Type II errors are inversely related, raising a reduces B thus increasing the power of the test. However, sociologists are hesitant to raise a since doing so increases the chance of deciding something is important when it is not important. With a small sample, using a small a level such as .001 means there is a great risk of B error. Many small-scale studies have a Type II error of over .50. This is common in research areas that rely on small samples. For example, a review of one volume of the Journal of Abnormal Psychology (this journal includes many small-sample studies) found that those studies average Type II error of .56 (Cohen 1990). This means the psychologist had inadequate power to reject the null hypothesis when H1 was true. When H1 was true, the chance of rejecting H0 (i.e., power) was worse than that resulting from flipping a coin.

Some areas that rely on small samples because of the cost of gathering data or to minimize the potential risk to subjects require researchers to plan their sample sizes to balance a, power, sample size, and the minimum size of effect that is theoretically important. For example, if a correlation of .1 is substantively significant, a power of .80 is important, and an a = .01 is desired, a very large sample is required. If a correlation is substantively and theoretically important only if it is over .5, a much smaller sample is adequate. Procedures for doing a power analysis are available in Cohen (1988); see also Murphy and Myous (1998).

Power analysis is less important for many sociological studies that have large samples. With a large sample, it is possible to use a conservative a error rate and still have sufficient power to reject the null hypothesis when H1 is true. Therefore, sociologists pay less attention to B error and power than do researchers in fields such as medicine and psychology. When a sociologist has a sample of 10,000 cases, the power is over .90 that he or she will detect a very small effect as statistically significant. When tests are extremely powerful to detect small effects, researchers must focus on the substantive significance of the effects. A correlation of .07 may be significant at the .05 level with 10,000 cases, but that correlation is substantively trivial.

STATISTICAL AND SUBSTANTIVE SIGNIFICANCE

Some researchers and many readers confuse statistical significance with substantive significance. Statistical inference does not ensure substantive significance, that is, ensure that the result is important. A correlation of .1 shows a weak relationship between two variables whether it is statistically significant or not. With a sample of 100 cases, this correlation will not be statistically significant; with a sample of 10,000 cases, it will be statistically significant. The smaller sample shows a weak relationship that might be a zero relationship in the population. The larger sample shows a weak relationship that is all but certainly a weak relationship in the population, although it is not zero. In this case, the statistical significance allows one to be confident that the relationship in the population is substantively weak.

Whenever a person reads that a result is statistically significant, he or she is confident that there is some relationship. The next step is to decide whether it is substantively significant or substantively weak. Power analysis is one way to make this decision. One can illustrate this process by testing the significance of a correlation. A population correlation of .1 is considered weak, a population correlation of .3 is considered moderate, and a population correlation of .5 or more is considered strong. In other words, if a correlation is statistically significant but .1 or lower, one has to recognize that this is a weak relationship—it is statistically significant but substantively weak. It is just as important to explain to the readers that the relationship is substantively weak as it is to report that it is statistically significant. By contrast, if a sample correlation is .5 and is statistically significant, one can say the relationship is both statistically and substantively significant.

Figure 3 shows power curves for testing the significance of a correlation. These curves illustrate the need to be sensitive to both statistical significance and substantive significance. The curve on the extreme left shows the power of a test to show that a sample correlation, r, is statistically significant when the population correlation, p (rho), is .5. With a sample size of around 100, the power of a test to show statistical significance approaches 1.0, or 100 percent. This means that any correlation that is this strong in the population can be shown to be statistically significant with a small sample.

What happens when the correlation in the population is weak? Suppose the true correlation in the population is .2. A sample with 500 cases almost certainly will produce a sample correlation that is statistically significant, since the power is approaching 1.0. Many sociological studies have 500 or more cases and produce results showing that substantively weak relationships, p = .2, are statistically significant. Figure 3 shows that even if the population correlation is just .1, a sample of 1,000 cases has the power to show a sample result that is statistically significant. Thus, any time a sample is 1,000 or larger, one has to be especially careful to avoid confusing statistical and substantive significance.

The guidelines for distinguishing between statistical and substantive significance are direct but often are ignored by researchers:

1. If a result is not statistically significant, regardless of its size in the sample, one should be reluctant to generalize it to the population.

2. If a result is statistically significant in the sample, this means that one can generalize it to the population but does not indicate whether it is a weak or a strong relationship.

3. If a result is statistically significant and strong in the sample, one can both generalize it to the population and assert that it is substantively significant.

4. If a result is statistically significant and weak in the sample, one can both generalize it to the population and assert that it is substantively weak in the population.

This reasoning applies to any test of significance. If a researcher found that girls have an average score of 100.2 on verbal skills and boys have an average score of 99.8, with girls and boys having a standard deviation of 10, one would think this as a very weak relationship. If one constructed a histogram for both girls and boys, one would find them almost identical. This difference is not substantively significant. However, if there was a sufficiently large sample of girls and boys, say, n = 10,000, it could be shown that the difference is statistically significant. The statistical significance means that there is some difference, that the means for girls and boys are not identical. It is necessary to use judgment, however, to determine that the difference is substantively trivial. An abuse of statistical inference that can be committed by sociologists who do large-scale research is to confuse statistical and substantive significance.

Figure 3. Power of test of r, a = .05

NONRANDOM SAMPLES AND STATISTICAL INFERENCE

Very few researchers use true random samples. Sometimes researchers use convenience sampling. An example is a social psychologist who has every student in a class participate in an experiment. The students in this class are not a random sample of the general population or even of students in a university. Should statistical inference be used here?

Other researchers may use the entire population. If one wants to know if male faculty members are paid more than female faculty members at a particular university, one may check the payroll for every faculty member. There is no sample— one has the entire population. What is the role of statistical inference in this instance?

Many researchers would use a test of significance in both cases, although the formal logic of statistical inference is violated. They are taking a ”what if’ approach. If the results they find could have occurred by a random process, they are less confident in their results than they would be if the results were statistically significant. Economists and demographers often report statistical inference results when they have the entire population. For example, if one examines the unemployment rates of blacks and whites over a ten-year period, one may find that the black rate is about twice the white rate. If one does a test of significance, it is unclear what the population is to which one wants to generalize. A ten-year period is not a random selection of all years. The rationale for doing statistical inference with population data and nonprobability samples is to see if the results could have been attributed to a chance process.

A related problem is that most surveys use complex sample designs rather than strictly random designs. A stratified sample or a clustered sample may be used to increase efficiency or reduce the cost of a survey. For example, a study might take a random sample of 20 high schools from a state and then interview 100 students from each of those schools. This survey will have 2,000 students but will not be a random sample because the 100 students from each school will be more similar to each other than to 100 randomly selected students. For instance, the 100 students from a school in a ghetto may mostly have minority status and mostly be from families that have a low income in a population with a high proportion of single-parent families. By contrast, 100 students from a school in an affluent suburb may be disproportionately white and middle class.

The standard statistical inference procedures discussed here that are used in most introductory statistics texts and in computer programs such as SAS and SPSS assume random sampling. When a different sampling design is used, such as a cluster design, a stratified sample, or a longitudinal design, the test of significance will be biased. In most cases, the test of significance will underestimate the standard errors and thus overestimate the test statistic (z, t, F). The extent to which this occurs is known as the ”design effect.” The most typical design effect is greater than 1.0, meaning that the computed test statistic is larger than it should be. Specialized programs allow researchers to estimate design effects and incorporate them in the computation of the test statistics. The most widely used of these procedures are WesVar, which is available from SPSS, and SUDAAN, a stand-alone program. Neither program has been widely used by sociologists, but their use should increase in the future.