G2 To Guy, William Augustus (1810-1885) (Statistics)

G2:

Symbol for the goodness-of-fit test statistic based on the likelihood ratio, often used when using log-linear models. Specifically given by

tmpCD-326_thumb

where O and E denote observed and expected frequencies. Also used more generally to denote deviance.

Gabor regression:

An approach to the modelling of time-frequency surfaces that consists of a Bayesian regularization scheme in which prior distributions over the time-frequency coefficients are constructed to favour both smoothness of the estimated function and sparseness of the coefficient representation. [Journal of the Royal Statistical Society, Series B, 2004, 66, 575-89.]

Gain:

Synonym for power transfer function.

Galbraith plot:

A graphical method for identifying outliers in a meta-analysis. The standardized effect size is plotted against precision (the reciprocal of the standard error). If the studies are homogeneous, they should be distributed within ±2 standard errors of the regression line through the origin.

Galton, Sir Francis (1822-1911):

Born in Birmingham, Galton studied medicine at London and Cambridge, but achieved no great distinction. Upon receiving his inheritance he eventually abandoned his studies to travel in North and South Africa in the period 1850-1852 and was given the gold medal of the Royal Geographical Society in 1853 in recognition of his achievements in exploring the then unknown area of Central South West Africa and establishing the existence of anticyclones. In the early 1860s he turned to meteorology where the first signs of his statistical interests and abilities emerged. His later interests ranged over psychology, anthropology, sociology, education and fingerprints but he remains best known for his studies of heredity and intelligence which eventually led to the controversial field he referred to as eugenics, the evolutionary doctrine that the condition of the human species could most effectively be improved through a scientifically directed process of controlled breeding. His first major work was Heriditary Genius published in 1869, in which he argued that mental characteristics are inherited in the same way as physical characteristics. This line of thought lead, in 1876, to the very first behavioural study of twins in an endeavour to distinguish between genetic and environmental influences. Galton applied somewhat naive regression methods to the heights of brothers in his topic Natural Inheritance and in 1888 proposed the index of co-relation, later elaborated by his student Karl Pearson into the correlation coefficient. Galton was a cousin of Charles Darwin and was knighted in 1909. He died on 17 January 1911 in Surrey.


Galton-Watson process:

A commonly used name for what is more properly called the

Bienayme-Galton-Watson process.

GAM:

Abbreviation for geographical analysis machine.

Gambler’s fallacy:

The belief that if an event has not happened for a long time it is bound to occur soon.

Gambler’s ruin problem:

A term applied to a game in which a player wins or loses a fixed amount with probabilities p and q (p = q). The player’s initial capital is z and he plays an adversary with capital a — z. The game continues until the player’s capital is reduced to zero or increased to a, i.e. until one of the two players is ruined. The probability of ruin for the player starting with capital z is qz given by

tmpCD-327_thumb

Game theory:

The branch of mathematics that deals with the theory of contests between two or more players under specified sets of rules. The subject assumes a statistical aspect when part of the game proceeds under a chance scheme.

Gamma distribution:

The probability distribution, f (x), given by

tmpCD-328_thumb

f is a scale parameter and y a shape parameter. Examples of the distribution are shown in Fig. 69. The mean, variance, skewness and kurtosis of the distribution are as follows.

tmpCD-329_thumb

The distribution of u = x/f is the standard gamma distribution with corresponding density function given by

tmpCD-330_thumb

[STD Chapter 18.] Gamma function: The function r defined by

r(r) = I tr—1e—tdt Jo

where r > 0 (r need not be an integer). The function is recursive satisfying the relationship

r(r + 1) = rr(r)

The integral

r(r + 1) = rr(r)

tmpCD-331_thumb

is known as the incomplete gamma function.

Gap statistic:

A statistic for estimating the number of clusters in applications of cluster analysis. Applicable to virtually any clustering method, but in terms of K-means cluster analysis, the statistic is defined specifically as

tmpCD-333_thumb

where Wk is the pooled within-cluster sum of squares around the cluster means and En denotes expectation under a sample of size n from the reference distribution. The estimate of the number of clusters, k, is the value of k maximizing Gapn(k) after the sampling distribution has been accounted for.

Gamma distributions for a number of parameter values.

Fig. 69 Gamma distributions for a number of parameter values.

Gap-straggler test:

A procedure for partitioning of treatment means in a one-way design under the usual normal theory assumptions.

Gap time:

The time between two successive events in longitudinal studies in which each individual subject can potentially experience a series of events. An example is the time from the development of AIDS to death.

Garbage in garbage out:

A term that draws attention to the fact that sensible output only follows from sensible input. Specifically if the data is originally of dubious quality then so also will be the results.

Gardner, Martin (1940-1993):

Gardner read mathematics at Durham University followed by a diploma in statistics at Cambridge. In 1971 he became Senior Lecturer in Medical Statistics in the Medical School of Southampton University. Gardner was one of the founders of the Medical Research Council’s Environmental Epidemiology Unit. Worked on the geographical distribution of disease, and, in particular, on investigating possible links between radiation and the risk of childhood leukaemia. Gardner died on 22 January 1993 in Southampton.

GAUSS:

A high level programming language with extensive facilities for the manipulation of matrices. [Aptech Systems, P.O. Box 250, Black Diamond, WA 98010, USA.

Timberlake Consulting, Unit B3, Broomsley Business Park, Worsley Bridge Road, London SE26 5BN, UK.]

Gauss, Karl Friedrich (1777-1855):

Born in Brunswick, Germany, Gauss was educated at the Universities of Gottingen and Helmstedt where he received a doctorate in 1799. He was a prodigy in mental calculation who made numerous contributions in mathematics and statistics. He wrote the first modern topic on number theory and pioneered the application of mathematics to such areas as gravitation, magnetism and electricity—the unit of magnetic induction was named after him. In statistics Gauss’ greatest contribution was the development of least squares estimation under the label ‘the combination of observations’. He also applied the technique to the analysis of observational data, much of which he himself collected. The normal curve is also often attributed to Gauss and sometimes referred to as the Gaussian curve, but there is some doubt as to whether this is appropriate since there is considerable evidence that it is more properly due to de Moivre. Gauss died on 23 February 1855 in Gottingen, Germany.

Gaussian distribution:

Synonym for normal distribution.

Gaussian quadrature:

A procedure for performing numerical integration (or quadrature) using a series expansion of the form

tmpCD-334_thumb

where xm are the Gaussian quadrature points and wm the associated weights, both of which are available from tables.

Gauss-Markov theorem:

A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.

Geary’s ratio:

A test of normality, in which the test statistic is

tmpCD-335_thumb

In samples from a normal distribution, G tends to y7(2/k) as n tends to infinity. Aims to detect departures from a mesokurtic curve in the parent population. [Biometrika, 1947, 34, 209-42.]

GEE:

Abbreviation for generalized estimating equations.

Gehan’s generalized Wilcoxon test:

A distribution free method for comparing the survival times of two groups of individuals. See also Cox-Mantel test and log-rank test.

Geisser, Seymour (1929-2004):

Born in New York City, Geisser graduated from the City College of New York in 1950. From New York he moved to the University of North Carolina to undertake his doctoral studies under the direction of Harold Hotelling. From 1955 to 1965 Geisser worked at the US National Institutes of Health as a statistician, and from 1960 to 1965 was also a Professorial Lecturer at George Washington University. He made important contributions to multivariate analysis and prediction. Geisser died on 11 March 2004.

Gene:

A DNA sequence that performs a defined function, usually by coding for an amino acid sequence that forms a protein molecule.

Gene-environment interaction:

An effect that arises when the joint effects of a genetic and an environmental factor is different from the sum of their individual effects.

Gene frequency estimation:

The estimation of the frequency of an allele in a population from the genotypes of a sample of individuals.

Gene mapping:

The placing of genes onto their positions on chromosomes. It includes both the construction of marker maps and the localization of genes that confer susceptibility to disease.

General Household Survey:

A survey carried out in Great Britain on a continuous basis since 1971. Approximately 100000 households are included in the sample each year. The main aim of the survey is to collect data on a range of topics including household and family information, vehicle ownership, employment and education. The information is used by government departments and other organizations for planning, policy and monitoring purposes.

General location model:

A model for data containing both continuous and categorical variables. The categorical data are summarized by a contingency table and their marginal distribution, by a multinomial distribution. The continuous variables are assumed to have a mult var ate normal d str but on n wh ch the means of the var -ables are allowed to vary from cell to cell of the contingency table, but with the variance-covariance matrixof the variables being common to all cells. When there is a single categorical variable with two categories the model becomes that assumed by Fisher’s linear discriminant analysis.

Generalizability theory:

A theory of measurement that recognizes that in any measurement situation there are multiple (in fact infinite) sources of variation (called facets in the theory), and that an important goal of measurement is to attempt to identify and measure variance components which are contributing error to an estimate. Strategies can then be implemented to reduce the influence of these sources on the measurement.

Generalized additive mixed models (GAMM):

A class of models that uses additive nonparametric functions, for example, splines, to model covariate effects while accounting for overdispersion and correlation by adding random effects to the additive predictor.

Generalized additive models:

Models which use smoothing techniques such as locally weighted regression to identify and represent possible non-linear relationships between the explanatory and response variables as an alternative to considering polynomial terms or searching for the appropriate transformations of both response and explanatory variables. With these models, the link function of the expected value of the response variable is modelled as the sum of a number of smooth functions of the explanatory variables rather than in terms of the explanatory variables them selves. See also generalized linear models and smoothing.

Generalized estimating equations (GEE):

Technically the multivariate analogue of quasi-likelihood with the same feature that it leads to consistent inferences about mean responses without requiring specific assumptions to be made about second and higher order moments. Most often used for likelihood-based inference on longitu-dinaldata where the response variable cannot be assumed to be normally distributed. Simple models are used for within-subject correlation and a working correlation matrix is introduced into the model specification to accommodate these correlations. The procedure provides consistent estimates for the mean parameters even if the covariance structure is incorrectly specified. The method assumes that missing data are missing completely at random, otherwise the resulting parameter estimates are biased. An amended approach, weighted generalized estimating equations, is available which produces unbiased parameter estimates under the less stringent assumption that missing data are missing at random. See also sandwich estimator. [Analysis of Longitudinal Data, 2nd edition, 2002, P.J. Diggle, K.-Y. Liang and S. Zeger, Oxford Science Publications, Oxford.]

Generalized gamma distribution:

Synonym for Creedy and Martin generalized gamma distribution.

Generalized linear mixed models (GLMM):

Generalized linear models extended to include random effects in the linear predictor.

Generalized linear models:

A class of models that arise from a natural generalization of ordinary linear models. Here some function (the link function) of the expected value of the response variable is modelled as a linear combination of the explanatory variables, x1, x2,…, xq, i.e.

f (E(y)) = A) + £1X1 + &X2 + ••• + AqXq wheref is the link function. The other components of such models are a specification of the form of the variance of the response variable and of its probability distribution (some member of the exponential family). Particular types of model arise from this general formulation by specifying the appropriate link function, variance and distribution. For example, multiple regression corresponds to an identity link function, constant variance and a normal distribution. Logistic regression arises from a logit link function and a binomial distribution; here the variance of the response is related to its mean as, variance = mean(1 — (mean/n)) where n is the number of observations. A dispersion parameter (often also known as a scale factor), can also be introduced to allow for a phenomenon such as overdispersion. For example, if the variance is greater than would be expected from a binomial distribution then it could be specified as 0mean(1 —(mean/n)). In most applications of such models the scaling factor, 0, will be one. Estimates of the parameters in such models are generally found by maximum likelihood estimation. See also GLIM, generalized additive models and generalized estimating equations.

Generalized mixed models (GMM):

Generalized linear models that incorporate a random effect vector in the linear predictor. An example is mixed effects logistic regression.

Generalized multinomial distribution:

The joint distribution of n discrete variables xj, x2,…, xn each having the same marginal distribution Pr(x = j)=pj (j = 0, 1, 2,…, k)

and such that the correlation between two different xs has a specified value p.

Generalized odds ratio:

Synonym for Agresti’s a.

Generalized Poisson distribution:

A probability distribution defined as follows:

tmpCD-336_thumb

The distribution corresponds to a situation in which values of a random variable with a Poisson distribution are recorded correctly, except when the true value is unity, when there is a non-zero probability that it will be recorded as zero.

Generalized principal components analysis:

A non-linear version of principal components analysis in which the aim is to determine the non-linear coordinate system that is most in agreement with the data configuration. For example, for bivariate data, y1, y2, if a quadratic coordinate system is sought, then as a first step, a variable z is defined as follows:

tmpCD-337_thumb

with the coefficients being found so that the variance of z is a maximum among all such quadratic functions of y1 and y2.

Generalized P values:

A procedure introduced to deal with those situations where it is difficult or impossible to derive a significance test because of the presence of nuisance parameters.

Generalized variance:

An analogue of the variance for use with multivariate data. Given by the determinant of the variance-covariance matrix of the observations.

Genetic algorithms:

Optimization procedures motivated by biological analogies. The primary idea is to try to mimic the ‘survival of the fittest’ rule of genetic mutation in the development of optimization algorithms. The process begins with a population of potential solutions to a problem and a way of measuring the fitness or value of each solution. A new generation of solutions is then produced by allowing existing solutions to ‘mutate’ (change a little) or cross over (two solutions combine to produce a new solution with aspects of both). The aim is to produce new generations of solutions that have higher values. [IMA Journal of Mathematics Applied in Business and Industry, 1997, 8, 323-46.]

Genetic epidemiology:

A science that deals with etiology, distribution, and control of disease in groups of relatives and with inherited causes of disease in populations.

Genetic heritability:

The proportion of the trait variance that is due to genetic variation in a population.

Genomics:

The study of the structure, fuction and evolution of the deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequences that comprise the genome of living organisms. Genomics is closely related (and almost synoynmous) to genetics; the former is more directly concerned with DNA structure, function and evolution whereas the latter emphasizes the consequences of genetic transmission for the distribution of heritable traits in families and in populations.

Genotype:

The set of alleles present at one or more loci in an individual.

GENSTAT:

A general purpose piece of statistical software for the management and analysis of data. The package incorporates a wide variety of data handling procedures and a wide range of statistical techniques including, regression analysis, cluster analysis, and principal components analysis. Its use as a sophisticated statistical programming language enables non-standard methods of analysis to be implemented relatively easily.

Geographical analysis machine:

A procedure designed to detect clusters of rare diseases in a particular region. Circles of fixed radii are created at each point of a square grid covering the study region. Neighbouring circles are allowed to overlap to some fixed extent and the number of cases of the disease within each circle counted. Significance tests are then performed based on the total number of cases and on the number of individuals at risk, both in total and in the circle in question, during a particular census year. See also scan statistic.

Geographical correlations:

The correlations between variables measured as averages over geographical units. See also ecological fallacy.

Geographical information system (GIS):

Software and hardware configurations through which digital georeferences are processed and displayed. Used to identify the geographic or spatial location of any known disease outbreak and, over time, follow its movements as well as changes in incidence and prevalence.

Geometric distribution:

The probability distribution of number of trials (N) before the first success in a sequence of Bernoulli trials. Specifically the distribution is given by

tmpCD-338_thumb

where p is the probability of a success on each trial. The mean, variance, skewness and kurtosis of the distribution are as follows:

tmpCD-339_thumb

Geometric mean:

A measure of location, g, calculated from a set of observations

%1; %2; •••; Xft as

tmpCD-340_thumb

The geometric mean is always less than or equal to the arithmetic mean.

Geostatistics:

A body of methods useful for understanding and modelling the spatial variability in a process of interest. Central to these methods is the idea that measurements taken at locations close together tend to be more alike than values observed at locations farther apart. See also kriging and variogram.

Gini concentration:

A measure of spread, VG(X), for a variable X taking k categorical values and defined as

tmpCD-341_thumb

where n = Pr(X = i), i = 1,…, k. The statistic takes its minimum value of zero when X is least spread, i.e., when Pr(X = j)=1 for some category j, and its maximum value (k — 1)/k when X is most spread, i.e., when Pr(X = i)=1/k for all i.

Gini, Corrado (1884-1965):

Born in a small town in the region of Veneto, Italy, Gini studied jurisprudence at the University of Bologna before becoming interested in statistics. His thesis for his degree became, in 1908, the first of his eighty topics, Il sesso dal punto di vista statistico. At the age of 26 Gini was already a university professor and throughout his life held chairs in the Universities of Cagliari, Padua and Rome. Founded two journals Genus and Metron and wrote over a thousand scientific papers in the areas of probability, demography and biometry. Elected an Honarary Fellow of the Royal Statistical Society in 1920, Gini was also a member of the Academia dei Lincei. He died on 13 March 1965.

GIS:

Abbreviation for geographical information system.

Gittins indices:

Synonym for dynamic allocation indices.

Glejser test:

A test for heteroscedasticity in the error terms of a regression analysis that involves regressing the absolute values of regression residuals for the sample on the values of the independent variable thought to covary with the variance of the error terms. See also Goldfield-Quandt test.

GLIM:

A software package particularly suited for fitting generalized linear models (the acronym stands for Generalized Linear Interactive Modelling), including log-linear models, logistic models, and models based on the complementary log-log transformation. A large number of GLIM macros are now available that can be used for a variety of non-standard statistical analyses.

GLLAMM:

A program that estimates generalized linear latent and mixed models by maximum likelihood. The models that can be fitted include multi-level models, structural equation models and latent class models. The response variables can be of mixed types including continuous, counts, survival times, dichotomous, and categorical.

GLM:

Abbrevation for generalized linear model.

GLMM:

Abbreviation for generalized linear mixed models.

Glyphs:

A graphical representation of multivariate data in which each observation is represented by a circle, with rays of different lengths indicating the values of the observed variables. Some examples are shown in Fig. 70. See also Andrews’ plots and Chernoffs faces.

GMM:

Abbreviation for generalized mixed models.

Gnedenko, Boris (1912-1995):

Born on 1 January 1912 in Simbirsk a town on the River Volga, Gnedenko studied at the University of Saratov. In 1934 he joined the Institute of Mathematics at Moscow State University and studied under Khinchin and later Kolmogorov. In 1938 he became associate professor in the Department of Mechanics and Mathematics. Gnedenko’s main work was on various aspects of theoretical statistics particularly the limiting distribution of maxima of independent and identically distributed random variables. The first edition of what is widely regarded as his most important published work, Limit Distributions for Sums of Independent Random Variables appeared in 1949. In 1951 Gnedenko published The Theory of Probability which remained a popular account of the topic for students for over a decade. Later in his career Gnedenko took an interest in reliability theory and quality control procedures and played a role in the modernization of Soviet industry including the space programme. He also devoted much time to popularizing mathematics. Gnedenko died in Moscow on 27 December 1995.

Examples of glyphs.

Fig. 70 Examples of glyphs.

Goelles, Josef (1929-2000):

Goelles studied mathematics, physics and psychology at the University of Graz, and received a Ph.D in mathematics in 1964. He began his scientific career at the Technical University of Graz and in 1985 founded the Institute of Applied Statistics and Systems Analysis at Joanneum Research and chaired it until shortly before his death. He collaborated in a large number of joint projects with clinicians and biologists and tried hard to convince other colleagues to incorporate statistical reasoning into their own discipline.

Golden-Thompson inequality:

An inequality relating to the matrix exponential transformation and given by trace[exp(A) exp(B)] > trace[exp(A + B)] with equality if and only if A and B commute.

Goldfield-Quandt test:

A test for heteroscedasticity in the error terms of a regression analysis that involves examining the monotonic relationship between an explanatory variable and the variance of the error term.

Gold standard trials:

A term usually retained for those clinical trials in which there is random allocation to treatments, a control group and double-blinding.

Gompertz curve:

A curve used to describe the size of a population (y) as a function of time (t), where relative growth rate declines at a constant rate. Explicitly given by

tmpCD-343_thumb

Goodman and Kruskal measures of association:

Measures of associations that are useful in the situation where two categorical variables cannot be assumed to be derived from perhaps unobservable continuous variables and where there is no natural ordering of interest. The rationale behind the measures is the question, ‘how much does knowledge of the classification of one of the variables improve the ability to predict the classification on the other variable’.

Goodness-of-fit statistics:

Measures of the agreement between a set of sample observations and the the corresponding values predicted from some model of interest. Many such measures have been suggested; see chi-squared statistic, deviance, likelihood ratio, G2 and X2.

Good’s method:

A procedure for combining independent tests of hypotheses.

Gossett, William Sealy (1876-1937):

Born in Canterbury, England, Gossett obtained a degree in chemistry at Oxford before joining Guinness breweries in Dublin in 1899. He continued to work for Guinness for the next three decades. Practical problems in his work led him to seek exact error probabilities of statistics from small samples, a previously un-researched area. Spent the academic year 19061907 studying under Karl Pearson at University College, London. Writing under the pseudonym of ‘Student’ (as required by Guinness) his paper essentially introducing the Student’s t-distribution was published in Biometrika in 1908. Gossett died in London on 16 October 1937.

Gower’s similarity coefficient:

A similarity coefficient particularly suitable when the measurements contain both continuous variables and categorical variables.

Grade of membership model:

A general distribution free method for the clustering of multivariate data in which only categorical variables are involved. The model assumes that individuals can exhibit characteristics of more than one cluster, and that the state of an individual can be represented by a set of numerical quantities, each one corresponding to one of the clusters, that measure the ‘strength’ or grade of membership of the individual for the cluster. Estimation of these quantities and the other parameters in the model is undertaken by maximum likelihood estimation. See also latent class analysis and fuzzy set theory.

Graduation:

A term employed most often in the application of actuarial statistics to denote procedures by which a set of observed probabilities is adjusted to provide a suitable basis for inferences and further practical calculations to be made.

Graeco-Latin square:

An extension of a Latin square that allows for three extraneous sources of variation in an experiment. A three-by-three example of such a square is

Aa BP Cy
By Ca AP
CP Ay Ba

Gram-Charlier Type A series:

An expansion of a probability distribution, f (x) in terms of Chebyshev-Hermite polynomials, Hr(x). Given explicitly by

tmpCD-344_thumb

where

tmpCD-345_thumb

Gramian matrix:

A symmetric matrix, A, whose elements are real numbers and for which there exists a matrix B also consisting of real numbers, such that BB’=A or B’B=A. An example is a correlation matrix.

Grand mean:

Mean of all the values in a grouped data set irrespective of groups.

Graphical methods:

A generic term for those techniques in which the results are given in the form of a graph, diagram or some other form of visual display. Examples are Andrew’s plots, Chernoff faces and coplots. [MV1 Chapter 3.]

Graph theory:

A branch of mathematics concerned with the properties of sets of points (vertices or nodes) some of which are connected by lines known as edges. A directed graph is one in which direction is associated with the edges and an undirected graph is one in which no direction is involved in the connections between points. A graph may be represented as an adjacency matrix. See also conditional independence graph.

Graunt, John (1620-1674):

The son of a city tradesman, Graunt is generally regarded as having laid the foundations of demography as a science with the publication of his Natural and Political Observations Made Upon the Bills of Mortality published in 1662. His most important contribution was the introduction of a rudimentary life table. Graunt died on 18 April, 1674 in London.

Greatest characteristic root test:

Synonym for Roy’s largest root criterion.

Greenberg, Bernard George (1919-1985):

Born in New York City, Greenberg obtained a degree in mathematics from the City College of New York in 1939. Ten years later after a period of military service he obtained a Ph.D. from the North Carolina State College where he studied under Hotelling. Greenberg founded the Department of Biostatistics at the University of North Carolina and was a pioneer in the field of public health and medical research. He died on 24 November 1985 in Chapel Hill.

Greenhouse-Geisser correction:

A method of adjusting the degrees of freedom of the within-subject F-tests in the analysis of variance of longitudinal data so as to allow for possible departures of the variance-covariance matrix of the measurements from the assumption of sphericity. If this condition holds for the data then the correction factor is one and the simple F-tests are valid. Departures from sphericity result in an estimated correction factor less than one, thus reducing the degrees of freedom of the relevant F-tests.

Greenhouse, Samuel (1918-2000):

Greenhouse began his career at the National Institutes of Health in the National Cancer Institute. Later he become Chief of the Theoretical Statistics and Mathematics Section in the National Institute of Mental Health. After 1974 Greenhouse undertook a full time academic career at George Washington University. He was influential in the early development of the theory and practice of clinical trials and his work also included the evaluation of diagnostic tests and the analysis of repeated measure designs. Greenhouse died on 28 September 2000 in Rockville, MD, USA.

Greenwood, Major (1880-1949):

Born in the East End of London, Greenwood studied medicine at University College, London and the London Hospital, but shortly after qualifying forsook clinical medicine and following a period of study with Karl Pearson was, in 1910, appointed statistician to the Lister Institute. Here he carried out statistical investigations into such diverse topics as the fatality of fractures and pneumonia in hospital practice, the epidemiology of plague and factors influencing rates of infant mortality. In 1919 he became Head of Medical Statistics in the newly created Ministry of Health where he remained until 1928, when he was appointed to the chair of Vital Statistics and Epidemiology at the London School of Hygiene and Tropical Medicine. Here he remained until his retirement in 1945. Greenwood was President of the Royal Statistical Society from 1934 to 1936 and was awarded their Guy medal in gold in 1945. He died on 5 October 1949.

Greenwood’s formula:

A formula giving the variance of the product limit estimator of a survival function, namely

tmpCD-346_thumb

where S(t) is the estimated survival function at time t, t^) < t(2) < ••• < hn) are the ordered, observed survival times, rj is the number of individuals at risk at time j and dj is the number who experience the event of interest at time tj. (Individuals censored at tj are included in rj.)

Gripenberg estimator:

A distribution-free estimator for the partial correlation between two variables, X and Y, conditional on a third variable, Z.

Group average clustering:

Synonym for average linkage clustering.

Group divisible design:

An arrangement of v = mn treatments in b blocks such that:

• each block contains k distinct treatments, k < v;

• each treatment is replicated r times;

• the treatments can be divided into m groups of n treatments each, any two treatments occurring together in A1 blocks if they belong to the same group and in X2 blocks if they belong to different groups. [Biometrika, 1976, 63, 555-8.]

Grouped binary data:

Observations on a binary variable tabulated in terms of the proportion of one of the two possible outcomes amongst patients or subjects who are, for example, the same diagnosis or same sex, etc. [SORT, 2004, 28, 125-60.]

Grouped data:

Data recorded as frequencies of observations in particular intervals.

Growth charts:

Synonym for centile reference charts.

Growth curve analysis:

A general term for methods dealing with the development of individuals over time. A classic example involves recordings made on a group of children, say, of height or weight at particular ages. A plot of the former against the latter for each child gives the individual’s growth curve. Traditionally low-degree polynomials are fitted to such curves, and the resulting parameter estimates used for inferences such as comparisons between boy and girls.

Growth rate:

A measure of population growth calculated as

tmpCD-347_thumb

Grubb’s estimators:

Estimators of the measuring precisions when two instruments or techniques are used to measure the same quantity. For example, if the two measurements are denoted by xt and y for i = 1,…, n, we assume that

tmpCD-348_thumb

where ri is the correct unknown value of the ith quantity and ei and Si are measurement errors assumed to be independent, then Grubb’s estimators are

tmpCD-349_thumb

GT distribution: A probability distribution, f (x), related to Student’s t-distribution and given by

tmpCD-350_thumb

where B is the beta function.

In f (x), a is a scale parameter while p and q control the shape of the distribution. Larger values of p and q are associated with lighter tails for the distribution. When p = 2 and a = V2a, this becomes Student’s t-distribution with 2q degrees of freedom.

Gumbel distribution:

Synonym for extreme value distribution.

Gumbel, Emil Julius (1891-1966):

Born in Munich, Germany, Gumbel obtained a Ph.D. in economics and mathematical statistics from the University of Munich in 1914. Between 1933 and 1940 he worked in France, first at the Institut Henri Poincare in Paris, later at the University of Lyon. Gumbel made important contributions to the theory of extreme values.

Gupta’s selection procedure:

A method which, for given samples of size n from each of k normal populations, selects a subset of the k populations which contains the population with the largest mean with some minimum probability.

Guttman scale:

A scale based on a set of binary variables which measure a one-dimensional latent variable. See also Cronbach’s alpha.

Guy, William Augustus (1810-1885):

Born in Chichester, Guy studied medicine at both Christ’s Hospital and Guy’s Hospital, London. In 1831 he was awarded the Fothergillian medal of the Medical Society of London for the best paper on asthma. In 1838 Guy was appointed to the Chair of Forensic Medicine at King’s College, London. Guy, like Farr, was strongly of the opinion that statistics was seriously needed for the study of medical problems, and his contribution to statistics rests primarily on the compilation of bodies of material relating to public health. Guy was a very active member of the Statistical Society of London and because of his work on behalf of the Society, the Royal Statistical Society voted in 1891 to establish in his honour the Guy medal. He was President of the Royal Statistical Society from 1873 to 1875. Guy died on 10 September 1885, in London.

Next post:

Previous post: