”Personality” is an ambiguous term derived from the natural language, and not necessarily a scientific concept. Consensus among interested scientists as to its precise meaning has been fairly modest. One major cleavage is between views of personality as core dynamic processes inherent in all people and views that emphasize characteristics on which individuals differ. Because the second sort of view defines personality in a way as to make it far more conducive to measurement, this article focuses on personality as certain potentially measurable characteristics of individuals.

Which measurable characteristics? There are many kinds of characteristics, and useful ways of categorizing characteristics have been developed (Norman 1967; Angleitner et al. 1990). Characteristics that fall into many of the categories do not fit common definitions of personality. Descriptors of physical characteristics (e.g., short, muscular) lack sufficient reference to psychological (behavioral, affective, cognitive) features. Descriptors invoking social roles (e.g., motherly, professional) and social effects (e.g., famous, neglected) involve social-contextualization and relativization too heavily to give inferences about an individual’s personality attributes. Descriptors of emotions (e.g., elated, afraid) and many motivational and intentional states (e.g., hungry, reluctant, inspired) are too prone to reference relatively transient characteristics. And some descriptors (e.g., awful, impressive) are so purely evaluative that they provide insufficient specificity with respect to psychological features.

Among descriptors that refer to presumably more internal and enduring psychological attributes, three categories stand out. Abilities or talents (e.g., skillful, creative, athletic) refer to maximum rather than typical levels of performance on tasks. Beliefs and attitudes (e.g., religious, racist, environmentalist) concern affectively tinged habits of mind pertaining to specific objects and concepts. Although personality models have frequently contained some ability- and attitude-related content, at their core are traits (e.g., daring, patient) that are more directly related to typical behavioral patterns. Because they are expressed behaviorally, enduring motivational patterns (e.g., need for achievement) might also be easily fit within definitions of personality that emphasize typical behavior patterns. ”Temperament” usually denotes the more clearly inborn and genetically derived aspects of personality, whereas ”character” is often used to denote acquired moral qualities. But these terms are otherwise synonymous with personality, which can be defined as consistencies in patterns of behavior—where behavior is defined broadly to include affect, cognition, and motivation—on which individuals differ. Overall, personality is a lay concept of sufficient importance and usefulness to have been taken up and refined by scientists.

In natural languages, personality descriptors are alternately represented as adjectives (e.g., adventurous), attribute nouns (e.g., adventurousness), or type nouns (e.g., adventurer). Because adjectives differentiate properties, personality adjectives are inherently central to personality description, although some languages lack an adjective class and carry on this adjective function in other ways (Dixon 1977). Psychologists move easily between adjectives and attribute-noun characterizations of the same trait (as with extraverted and extraversion): Either form suggests properties that exist in varying degrees. Type-noun characterizations, in contrast, imply a categorization—one either is an extravert or one is not—which in turn suggests the assumption of a bimodal frequency distribution of individuals on the trait. Such bimodal distributions appear to be rare. Although type-noun characterizations of personality traits have great popular appeal, their use by academic psychologists has become quite limited. There has been a search for categorical taxons underlying the traits related to certain mental disorders (e.g., Meehl 1995), but this task is not easy. Recent correlational studies indicate that symptoms of mental disorders, and of personality disorders in particular, are continuously distributed in the population, and have substantial overlap with measures of various personality traits (Costa and Widiger 1994).

In the last three decades, many psychologists have addressed the fundamental issue of whether personality traits are real, or whether they exist only in the eye of the beholder. Mischel’s early review (1968) suggested that personality measures were at best modest predictors of relevant actual behaviors. However, a consensus has emerged that when criterion behaviors are aggregated, across time or across situations, personality measures can become quite highly predictive (Kenrick and Funder 1988). This follows, of course, from definitions of personality offered above: Personality does not denote single behaviors but rather consistent patterns in multiple behaviors. Moreover, studies involving twins and adoptees have repeatedly indicated that personality-trait scores are partially (as much as 50 percent) heritable, implying biological underpinnings. Such behavior-genetic findings have stimulated the development of models that set forth biological explanations for personality variation, attempting to delineate phenotypic constructs so that they correspond directly to known biological mechanisms (e.g., Gray 1987; Rothbart et al. 1994; Zuckerman 1995). Nonetheless, studies with twins and adoptees also indicate that differences in experience, mainly of the type not shared by family members, have a profound effect on personality characteristics (Plomin 1990). There are also some indications of important effects of culture.

Although personality traits reflect behavioral consistencies across situations, a variety of research findings indicate that some situations facilitate expression of personality traits more than do other situations (Caspi and Moffitt 1993). Highly structured or ritualized social settings (e.g., a funeral home, a lecture hall) tend to attenuate the expression of personality differences, whereas relatively unstructured settings (e.g., a nightclub, a playground) bring out personality differences. This pattern has three important consequences for personality measurement. First, to the extent that a society’s social milieux are age-stratified, one might expect across-time continuity in apparent traits to be somewhat ”heterotypic”—leading to different surface characteristics at different ages. Second, personality characteristics are best assessed by placing the individual in a relatively unstructured situation in which responses are relatively unconstrained by social norms; therefore, the stimuli on personality measures should not have correct responses. Third, to the extent that a cultural milieu is highly structured and ritualized, one might expect to see less emphasis on personality differences than would be found in relatively individualistic cultural mileux (Miller 1984).


Measurement can be defined as a set of rules for assigning numbers to entities (such as individuals) in such a way that attributes of the entities are faithfully represented. Measurement procedures typically set up, or specify, regularized administration conditions. Thus, measurement generally involves consistency of procedure; but personality measurement has an even deeper relation to consistency.

As noted above, a consensual definition of personality would emphasize characteristics that are internally rather than externally caused, psychological rather than overtly physical, and stable and enduring rather than transient. Because of the emphasis on stable and enduring qualities, reliability—relative absence of measurement error—is of first importance in personality measurement. A reliable measure by definition registers characteristics that are consistent across time, as indicated by test-retest reliability coefficients or across situations. Internal consistency, or inter-item reliability, is an analogue of cross-situational stability. Each item represents a unique situation in either of two ways: (1) its referent content may refer to a distinct situation (e.g., ”I talk a lot at parties”), or (2) simply being presented with this item as distinguished from another item (e.g., ”I talk very little at parties”) is a unique immediate situation for the respondent. Reliability is most often measured either by a stability coefficient, indicating the correlation between scores at one time and those at another, or by an internal consistency coefficient (e.g., coefficient alpha) that represents the average correlation between pairs of all possible split halves of the test items, although a variety of alternative reliability models are available. Until cross-time stability is established, a trait’s stability is only presumed and it might be better termed an attribute, since the latter term has fewer implications as to stability.

Another criterion for the reliability of personality judgments is the extent to which different observers agree in rating a target. This is a more demanding criterion; ideally, to the degree an individual has a characteristic it will be obvious to both self and observers. However, a number of influences tend to attenuate agreement between observers. Some characteristics (e.g., sociability) are highly observable, whereas others (e.g., anxiety level) are less so. Interobserver agreement is typically reduced by having observed the target at different times or in different situations. Generally, we might expect the self to be the most privileged observer, but in certain ways the self-viewpoint can be misleading. Personality characterizations have social functions and, understandably, self-observer agreement is prone to be affected by conscious impression management and unconscious self-enhancement tendencies (Paulhus and Reid 1991). Finally, quite independently of content, observers differ in their use of measurement scales (e.g., differential tendencies to agree or disagree, respond extremely, or use the middle option if available). Given this minefield of potential difficulties, the moderate level of interobserver agreement documented in the research literature may be quite remarkable. It makes sense to capitalize on the conjoint perspective of multiple judges: the best arbiter of the degree to which an individual can be characterized with a certain trait may be the pooled judgments of several observers well acquainted with the subject (Hofstee 1994; Kolar et al. 1996), perhaps conjoined with self-ratings. Though greater acquaintance clearly increases judges’ accuracy, there is, particularly for the more observable traits, surprisingly good consensus among near strangers for the traits of a target (Borkenau and Liebler 1993).

Another important, but demanding, index of consistency is across-time stability: within a sample of individuals, the extent to which one’s relative standing at time 1 correlates with that at time 2. Across-time stabilities tend to be very high for short intervals (e.g., a day or a week) but diminish with greater intervals to a more moderate level. Even across long intervals, cross-time stabilities in adulthood—particularly after age 30—for most personality traits are impressively high (Costa and McCrae 1997). It appears, however, that the further the measurement intervals reach into childhood and especially infancy, the lower the stabilities become; judgments of infant temperament may not do much better than chance in predicting judgments of later adult personality. Stabilities may be held down by the incommensurability of the contexts within which infant and adult temperaments function: It is difficult to apply many adult traits (e.g., industrious, artistic, unselfish) to infants, so that any forms of continuity would have to be heterotypic. It seems likely, however, that levels of traits often do change from childhood to adulthood. Part of this change could be genetically programmed, as a different set of genes comes on-line with greater maturity, and an initial set goes off-line; on the other hand, much may change under the influence of experience. To Wordsworth’s assertion that ”the child is father of the man,” psychometricians offer an assent beset with caveats: ”usually,” ”in many ways,” ”with definite exceptions.”


Data on behavior patterns are most commonly elicited from self and observers or acquaintances using standardized measures of personality traits. Scores on these structured measures are compared within a sample in a ”nomothetic” manner, that is, seeking generalizations that can be applied to all individuals. Historically, the dominant position of this structured, nomothetic approach stems from the success of well-known inventories like (1) the Minnesota Multiphasic Personality Inventory (MMPI), which is actually more of a psychiatric symptom inventory than a personality inventory; (2) the California Psychological Inventory (CPI), which resembles the MMPI in numerous respects but taps rather different content, with scales labeled so as to stress the presence (or absence) of adaptive traits; and (3) the Myers-Briggs Type Indicator (MBTI), a measure based on parts of C. G. Jung’s typology. The MBTI has been criticized by psychometricians, ignored by academic researchers, yet bought up by the millions in other circles. Today, these older inventories have competition from numerous new inventories that in some cases are shorter and more efficient.

Nonetheless, there are potentially useful alternatives to the questionnaire. Some embody an idiographic approach—seeking individually unique constructs that are not generalized to all people—rather than a nomothetic one. For example, in George Kelly’s Role-Construct Repertory Test, each testee nominates a set of personally significant acquaintances, then derives idiographic constructs by comparing subsets of them. Such idiographic measures undoubtedly have a unique contribution: They may generate results that are more meaningful to the individual measured. But knowledge of general laws illuminates understanding of the individual case; it is possible to adapt many nomothetic measures to serve idiographic ends. Thus, improved nomothetic understanding lays the groundwork for improved idiographic understandings.

Questionnaires, whether used nomothetically or idiographically, are essentially overt and direct in their measurement approach. A trait is assessed with reference to a person’s behaviors, emotions, and cognitions. Descriptions, which may be at a rather broad level, are collected. This overt method can be highly efficient, but has a significant disadvantage: Because the descriptive content provides clues to what is being measured, respondents completing the measure could, if motivated, intentionally present an inaccurate picture. Moreover, responses can be provided thoughtlessly. Some of those who are dissimulating or not paying attention can be identified using so-called validity indexes. These indexes are computed by scouring the response pattern for various signs of less than honest and accurate responding: unusual levels of agreement with unfavorable items, disagreement with favorable items, denying common vices, claiming rare virtues, responding dissimilarly to items with similar content, or responding similarly to items with contradictory content.

Projective measures, in contrast, are covert measures of personality that are more resistant to dissimulating or inattentive responding. These measures assume a ”projective hypothesis” first defined by Rorschach, Jung, and others early in this century: If an individual is presented with a vague or ambiguous stimulus, that individual’s response will be determined by habitual internal tendencies, preoccupations, and cognitive styles, rather than being affected by features of the stimulus. In a word, respondents ”project” their proclivities onto the stimulus. Projective measures are potentially very sensitive receptors for personality variation. As noted above, personality differences are clearest when individuals are confronted with unstructured situations; vague and ambiguous stimuli are unstructured situations. One might simply place the individual in an unstructured situation and observe which behaviors, emotions, and thoughts ensue.

The most popular unstructured stimuli for these purposes have been inkblots (e.g., the Rorschach and Holtzman stimulus sets), sets of pictures—selected for their ambiguity—about which stories are elicited (e.g., the Thematic Apperception Test [TAT] and its derivatives), and figure drawings; in the last case the individual is presented with blank paper and asked to draw a certain object (e.g., person, house, tree). Other commonly used unstructured stimuli include incomplete sentences (e.g., ”Most people_.”) and single words for which an association is elicited. The raw material provided by the respondent must then be coded and interpreted with reference to response patterns of aggregate respondents. These measures capture aspects of personality covertly and indirectly; due to the ambiguity of the test materials, respondents are unlikely to guess what is being measured.

Though attractive in theory, projective measures have been problematic in practice. The rock upon which they are prone to founder is the crucial one of reliability. A first problem is that individuals’ responses to projective stimuli are affected by social context and environment, and to a considerable degree they change from one day to another. This problem may be partially solved by gathering responses to many stimuli, preferably on multiple occasions, and looking for consistent patterns across time and across stimuli. A second problem is that observers often have low levels of agreement with regard to coding and interpreting the stimuli; that is, there is often a great deal of interobserver noise obscuring any underlying signal. There have been recent attempts to create interpretive coding schemes with better reliability. The best example is Exner’s comprehensive system for the Rorschach (1986), which integrates features of several previous Rorschach scoring systems.

The Rorschach and the TAT remain fairly popular measures in clinical settings, and continue to generate a stream of research. Presently, however, many psychologists are skeptical about the usefulness of these measures, given the laborious, complex procedures for collecting and scoring data. One problem may be the sheer volume and range of the data that such free-response methods bring in; perhaps only a fraction of these data are of any importance, and we are not yet sure which fraction deserves the most attention. Projective measures might in the future become increasingly important, to the extent that they can be made more reliable, parsimonious, and efficient.

Other forms of data can be coded using the interpretive schemes developed for projective measures. For example, politician’s speeches, reports of early memories, and virtually any autobiographical material can be analyzed in much the same way as stories elicited from the TAT. Most often, such material has been analyzed in terms of implicit motivational features (e.g., achievement, power, intimacy), and evidence suggests that such covertly measured motivation is not substantially correlated with indexes of similar content derived from structured measures of self-attributed motivation (McClelland et al. 1989). In general, autobiographical data seems to provide information outside that provided by personality questionnaires, given that individuals seem to store schematic beliefs about traits separately from autobiographical memories (Klein and Loftus 1993). Therefore, autobiographical data could become an important part of the comprehensive assessment of individuals.


Whether the measurement method is overt or covert, another crucial issue concerns the particular traits that one ought to measure. Most commonly, this issue has been handled within a scale-construction strategy that might be called ”rational”: A researcher decides which trait (i.e., construct) he or she wants to measure, creates a pool of potential items, tries them out on a sample of respondents, and perhaps iterates between data and preconceived theory to create a relatively efficient measure of the construct. A second, ”empirical” strategy is in some ways a variant of the first. The researcher includes in the sample of respondents one or more criterion groups (e.g., introverts, psychopaths, artists) and determines the set of items that best differentiates each criterion group from a control sample, thus leading to a ”criterion-keyed” scale for the construct (e.g., introversion, psychopathy, creative temperament). In either strategy, the researcher begins with an a priori conception of what ought to be measured, but in the empirical strategy this conception is identified with a criterion group. It is not difficult to combine rational and empirical strategies, as was done in the only major revision of the original MMPI.

Unfortunately, a field that accumulates a great host of a priori conceptions can become quite chaotic, and this was the predominant state of affairs in personality measurement until at least the 1970s, when expert compendiums on personality traits could still be organized alphabetically by trait (e.g., London and Exner 1978), as if there were no other way to order them. There were many constructs, and it was clear that some of them were related to others, but the structure underlying the whole set of constructs was unclear. From the early decades of the twentieth century, investigators seeking an ordering framework turned to a statistical technique called ”factor analysis.” Factor analysis is a method for reducing a large number of observed variables to a smaller number of hypothetical variables (factors), by analyzing the covariances among the observed variables and identifying redundancies in the set of variables. Factor analysis can be used to identify parsimonious sets of variables within sets of items built by any scale-construction strategy. Historically, reviews of factor analyses of various collections of personality scales (e.g., French, 1953) have not led to a consensus on a common framework (Goldberg 1972).

Significant progress on the structural problem came largely by temporarily averting attention from the a priori constructs of experts in order to study those personality conceptions of laypersons that are embedded in the natural language. As noted at the outset, personality traits are socially meaningful phenomena about which laypersons comment and generalize, and the lexicon of any language is a repository of descriptors referencing a wide variety of human characteristics. The lexical hypothesis formalizes this state of affairs into a strategy for identifying necessary features for an organizing framework, or taxonomy, of personality attributes (Goldberg 1981). This hypothesis essentially states that the more important the attribute, the more likely people are to develop a word for it. The most important attributes will then be those represented by numerous terms (often representing specific aspects of broader concepts) within one language, and by recurrence across many languages. Once descriptors are gathered, for example from a dictionary, they can be used by individuals to describe themselves or others. Factor analysis of this data, in any language, can be used to search for a few dimensions underlying numerous descriptors.

Expert personality constructs are typically based on certain aspects of the lay descriptive vocabulary, but scientists may refine and extend lay distinctions in useful ways. Therefore, one cannot obtain a sufficient model of personality traits by studying lexical descriptors, but one can find necessary features for such a model, aspects that—based on their salience to lay observers—are too important to leave out. In this respect, lexical models of personality dimensions offer minimum-content criteria for other personality models, pointing clearly to some (but not all) of the trait concepts important enough to measure. In practice, lexical models have helped focus attention on important variables previously omitted from expert-derived models.

Lexical studies involve (1) culling descriptors from a dictionary; (2) omitting descriptors that are infrequently used or, by the consensus of multiple judges, refer to categories less relevant to personality (e.g., physical traits, temporary states); (3) aggregating the remaining descriptors (typically 300 to 400) into a questionnaire format with a multipoint (e.g., 1 to 5) rating scale; (4) administering the forms so constructed to a large (usually >400) sample of respondents for description of self, a well-acquainted peer, or sometimes both; and (5) factor-analyzing the descriptors to derive an indigenous or ”emic” personality structure for that language. Such studies have been conducted in over a dozen languages, including English, German, Dutch, Italian, Spanish, Hungarian, Czech, Polish, Filipino, Korean, Turkish, and Hebrew.

Findings of these lexical studies (reviewed by Saucier 1997) show some variations, probably due to differences in sampling of subjects and variables as much as to actual differences between languages. But the most common result has been a robust structure of five independent (uncorrelated) factors, with apparent cross-language universality for the three largest of these factors: extraversion (which includes sociability, activity, and assertive-ness), agreeableness (which includes warmth, generosity, humility, patience, and nonaggressiveness), and conscientiousness (which includes dependability, orderliness, and consistency). The remaining two factors (one referencing aspects of emotional stability, the other aspects of intellect, imagination, and unconventionality) are generally smaller and more variant from one study to another. Despite these partial inconsistencies between one emic structure and another, the five-factor structure, often labeled the ”Big Five,” has been shown to be easily translatable into a large number of languages (McCrae et al. 1998). Moreover, the

Big Five appears to capture the structure of trait judgments about children as well as adults (Digman and Shmelyov 1996).

The Big Five has had considerable influence on personality questionnaires. For example, one prominent three-factor inventory (the NEO Personality Inventory) was revised to add the two missing factors from the Big Five, in this case agreeableness and conscientiousness (McCrae and Costa 1985). Moreover, the five factors have strong relations to the constructs measured by other prominent inventories, including the Myers-Briggs Type Indicator, the 16PF inventory, and the Personality Research Form. Four of the five factors (excluding intellect/imagination) are substantially correlated with measures of personality disorder symptoms and some mood, anxiety, and impulse control disorders cataloged in current psychiatric nosologies; a safe generalization seems to be that disorders tend to co-occur with extreme scores on personality dimensions (like the Big Five) on which there is wide variation in the general population (Costa and Widiger 1994). However, psychotic syndromes map rather poorly onto the Big Five, as do a few other clearly important individual-differences constructs, like religiousness and attractiveness (Saucier and Goldberg 1998)

Although the Big Five have obtained some degree of consensus as an organizing framework for personality characteristics, there are at least five remaining issues whose resolution might lead to a different consensual structure: (1) The generalizability of the Big Five factors to languages spoken in non-Western, nonindustrialized nations, and indeed in less complex societies, is as yet uncertain. (2) The Big Five represent very broad, global trait constructs, and groups of more specific constructs are typically more useful in prediction contexts; however, there is as yet little consensus about the particular specific subcomponents that make up each of the broad factors. (3) There may be factors in related domains, such as abilities or attitudes, that could arguably be added to the model. (4) The organization of personality variables into factors based on lexical representation might be reasonably superseded by a set of factors based on a superior rationale, for example, correspondence to main lines of biological or environmental influence. And (5) there may be constructs well represented among natural-language descriptors that will prove to be of great importance. Personality-relevant constructs with apparently meager representation in natural language descriptors, but attracting much current research interest, include those having to do with (1) defense mechanisms (Paulhus et al. 1997), (2) coping styles (Suls et al. 1996), and (3) personal goals and strivings (Pervin 1989).

One important criterion by which personality constructs might be added to, or eliminated from, a basic descriptive model is validity. Validity should be clearly distinguished from reliability: Reliability concerns whether a scale is measuring anything at all, and is a prerequisite to any form of validity. Validity concerns the meaning of a scale score, that is, the accuracy of the inferences one can make from the scale. For a construct that demonstrates validity, there is a good argument for meaningful-ness and usefulness with respect to other phenomena. Validity evidence is gathered in an ongoing process, rather than in any single study.

A prior question, of course, is whether personality measures have a substantial enough degree of validity to make them worthwhile. Mischel’s early critique (1968) suggested they did not. An increasingly large volume of studies documents the many ways in which they do. For example, conscientiousness is a valuable predictor of effective job-related performance (even after the predictive value of intelligence is accounted for and removed), and has also been associated with increased longevity. Low scores on emotional stability (i.e., high scores on neuroticism) are predictive of higher rates of divorce, of male midlife crises, and of health-related complaints (though not actually greater illness). Agreeableness has associations with aspects of conflict (or absence of conflict) in close relationships, and extraversion predicts variation in a wide range of social-interaction variables.


As the foregoing review indicates, the science of personality measurement is no longer at the primitive stage represented by early taxonomies of virtues (such as those of Plato or Confucius), of physiological humors (such as that of Galen), or by the pseudoscience of astrological signs which beckons from magazine racks. Unlike these approaches, current personality measurement is far more explicit about (1) defining personality, (2) measuring attributes in a standardized, reliable manner, (3) attending to multiple sources of data, (4) checking the validity of hypothesized models, and (5) placing the plethora of possible constructs within a parsimonious and empirically justifiable organizing framework. Nonetheless, current scientific practices are inevitably based on assumptions that are subject to being overturned. Rorer (1990) provides a conceptual review of personality assessment with attention to differences between the assumptions of mainstream and alternative paradigms.

Personality as a discipline has come to be located mainly within the larger umbrella of psychology. This is reasonable, given that personality deals with the behavior, affect, and cognitions of individuals, and with individual differences. But certain aspects of personality seem to be equally relevant to other sciences. Many of the more inborn, dispositional aspects of personality are clearly rooted in biology and genetics, and personality change may well be associated with physiological changes—as cause or effect. Moreover, humans are social as well as biological beings, and personality functions within a social context. The sociological aspects of personality need more attention: Although Goffman (1972) proposed aspects of one useful sociological theory of personality, and Bellah et al. (1985) described certain social-structural contexts that may foster narcissism, generally the relation of personality variation to its broader social context is but dimly understood.

In anthropology, one finds a rich tradition of studies of ”culture and personality.” Past work in this area has been hindered by use of less than adequate personality measurement models. For example, many studies used projective personality measures with insufficiently developed reliability. If improvements are made in analysis of projective data, it may be possible to take a new look at old data. Moreover, linguists and anthropologists may be uniquely well equipped to gather and judge evidence pertinent to the cross-cultural universality of basic personality dimensions, and help answer several important questions: Which dimensions of interindividual variation are not only measurable in any culture, but derivable from the indigenous language of any culture? What is the meaning of between-culture variation in the classification of personality characteristics? To what extent is ”modal personality” a viable way of differentiating societies, or of mapping cultural change, and how do temperament and social structure interact? Do cultures differ in how they organize the heterogeneity that personality variation introduces?

The fascinating generalizations offered by Mead, Benedict, and others must be considered provisional, given the probably limited range and value of data upon which they are based. If waves of advancement in personality measurement were joined to waves of advancement in other social sciences, a powerful current might ensue, which would yield a far better understanding of societies in terms of the diverse range of humans within each of these societies.

