Hypothesis testing or significance testing is undoubtedly one of the most widely used quantitative methodologies in empirical research in the social sciences. It is one viable way to use statistics to examine a hypothesis in light of observations or sample information. The starting point of hypothesis testing is specifying the hypothesis to be tested, called the null hypothesis. Then a test statistic is chosen to summarize the sample information, and its value is taken as an indication of the strength of sample evidence against the null hypothesis.
Modern hypothesis testing dates to the 1920s and the work of Ronald Aylmer Fisher (1890-1962) on the one hand, and Jerzy Neyman (1894-1981) and Egon Pearson (1895-1980) on the other. Fisher (1925) refers to hypothesis testing as significance testing (this entry does not distinguish between the two terms). In the Fisherian approach, the observed test statistic is converted to the P-value, which is the probability of obtaining the observed or more extreme value of the test statistic under the null model; the smaller the P-value, the stronger the sample evidence against the null hypothesis. An early example of Fisher’s significance testing was conducted in 1735 by the father and son Swiss mathematicians Daniel Bernoulli (1700-1782) and John Bernoulli (1667-1748). They tested for the random/uniform distribution of the inclinations of the planetary orbits. A detailed discussion of their original results and subsequent modifications of their results can be found in Anders Hald (1998).
In the Neyman and Pearsonian (1928, 1933) approach, an alternative hypothesis is specified and the null hypothesis is tested against this alternative hypothesis. The specification of an alternative hypothesis allows the computation of the probabilities of two types of error: Type I error (the error of falsely rejecting a null hypothesis) and Type II error (the error of incorrectly accepting a null hypothesis).
Type I error is also referred to as the significance level of the test, and one minus Type II error the power of the test. Given that the two types of error cannot be minimized simultaneously, the common practice is to specify the level of significance or Type I error and then use a test that maximizes its power subject to the given significance level. In the Fisherian approach, the /-value is reported without necessarily announcing the rejection or nonrejection of the null hypothesis, whereas in the Neyman and Pearsonian approach, the null hypothesis is either rejected in favor of the alternative hypothesis or not rejected at the given significance level. E. L. Lehmann (1993) provides a more detailed comparison of the two approaches.
In empirical research, a mixture of the two approaches is typically adopted. Consider the linear regression model:
explanatory variables on the dependent variable. The significance of these effects is routinely tested by the t-tests and /-test. The t-test was discovered by William Sealy Gosset (1876-1937) for the mean of a normal population and extended by Fisher in 1925 to other contexts, including regression coefficients. Gosset’s result was published in Biometrika under the pseudonym "Student" in 1908. The /-test was originally developed by Fisher in the context of testing the ratio of two variances. Fisher pointed out many other applications of the /-test, including the significance of the complete regression model.
absent from the regression model (1) and thus considered to be insignificant in explaining the dependent variable given the presence of the other explanatory variables. This is why t-tests are referred to as tests for the significance of individual variables as opposed to the /-test, which tests for the significance of the complete regression. The null hypothesis for the /-test is
There are several equivalent formulas for computing the /-statistic, one of which is
where R2 is the coefficient of determination. Since under H0, all the explanatory variables can be dropped from (1), the /-test is a test for the significance of the complete regression.
Much packaged computer software routinely calculates the t-statistics and the /-statistic. For a given sample, the observed value of t. (/) summarizes the sample evidence on the significance of the explanatory variable X. (the significance of the regression (1)). To either convert the observed value of t (F) to the /-value or make a binary decision on the rejection or nonrejection of the null hypothesis H. (H0) at a given significance level, the distribution of t. (/) under the corresponding null hypothesis is required. On the basis of the null hypothesis being true and further assumptions on the nature of the sample and on the normality of the error in (1), the distribution of t. is known to be Student’s t with (K – 1)
under the null hypothesis allows the computation of the /-value or the computation of the appropriate critical value at a prespecified significance level with which the observed test statistic can be compared.
Like t-tests and the /-test, standard tests rely on further assumptions in addition to the truth of the null hypothesis, such as the assumption of a random sample and the normality of the error term. These further assumptions may not be met in typical applications in social sciences, and modifications are required of tests designed on the basis of these assumptions. For example, when normality of the error term is not met, the distribu-
are known under general conditions and may be used to perform these tests. Alternatively, resampling techniques, such as the bootstrap and subsampling, may be used to approximate the distributions of the test statistics under the null hypothesis (see Efron and Tibshirani  and Politis et al.  for an excellent introduction to these methods).
The issue that has generated the most debate in hypothesis testing from the beginning is the choice of significance level (Henkel 1976). Given any value of the test statistic, one can always force nonrejection by specifying a low enough significance level or force rejection by choosing a high enough significance level. Although reporting the P-value partly alleviates this arbitrariness in setting the significance level, it is desirable to report estimates of the parameters of interest and their standard errors or confidence intervals so that the likely values of the unknown parameters and the precision of their estimates can be assessed.