EVALUATION RESEARCH

BRIEF HISTORY

There is no uniformly accepted definition of what constitutes evaluation research. At perhaps its narrowest point, the field of evaluation research can be defined as ”the use of scientific methods to measure the implementation and outcomes of programs for decision-making purposes” (Rutman 1984, p. 10). A broader, and more widely accepted, definition is ”the systematic application of social research procedures for assessing the conceptualization, design, implementation, and utility of social intervention programs” (Rossi and Freeman 1993, p. 5). A much broader definition is offered by Scriven (1991), who suggests that evaluation is ”the process of determining the merit, worth and value of things” (p. 1). In the latter definition, the notion of what can be evaluated is not limited to a social program or specific type of intervention but encompasses, quite literally, everything.

Any description of the history of evaluation research depends on how the term is defined. Certainly, individuals have been making pronouncements about the relative worth of things since time immemorial. In the case of social programs, proficiency requirements to guide the selection of public officials using formal tests were recorded as early as 2200 b.c. in China (Guba and Lincoln 1981). Most observers, however, date the rise of evaluation research to the twentieth century. For example, programs in the 1930s established by the New Deal were viewed as great opportunities to implement social science methods to aid social planning by providing an accounting of program effects (Stephan, 1935). Modern evaluation research, however, underwent explosive growth in the 1960s as a result of several factors (Shadish et al. 1991). First, the total amount of social programming increased tremendously under the administrations of Presidents Kennedy, Johnson, and Nixon. New programs were directed toward social issues such as education, housing, health, crime, and income maintenance. Second, along with these huge financial investments came the concern by Congress about whether these programs were achieving their intended effect. As a result, Congress began mandating evaluations. Third, program managers were concerned whether programs were being implemented in the manner intended, and consequently data were required to monitor program operations. In addition, there were intellectual issues about how best to implement programs and the relative effectiveness of various approaches to offsetting various social ills. Outcome data were needed to compare competing approaches. The result was a burgeoning demand for trained evaluators; and the large number of scientists involved in the common enterprise of evaluation became sufficient to support the development of evaluation research as a scientific specialty area.

The field of evaluation research is no longer expanding at the rate it was in the 1960s and 1970s (Freeman 1992). By the 1980s, there was a substantial decline in the funding for evaluation activities that was motivated, in part, by the budget cuts of the Reagan administration. By then, however, the field of evaluation research had been established.

It continues to thrive for several reasons (Desautels 1997). First, difficult decisions are always required by public administrators and, in the face of continuing budget constraints, these decisions are often based on accountability for results. Second, an increasingly important aspect of service provision by both public and provide program managers is service quality. Monitoring quality requires information about program practices and outcomes. Third, there is growing public demand for accountability in government, a view increasingly echoed by government representatives. Meeting these demands requires measurement of results and a management system that uses evaluation for strategic planning and tactical decision making.

Early in its history, evaluation was seen primarily as a tool of the political left (Freeman 1992). Clearly, that is no longer the case. Evaluation activities have demonstrated their utility to both conservatives and liberals. Although the programs of today may be different from those launched in the 1960s, evaluation studies are more pervasive than ever. As long as difficult decisions need to be made by administrators serving a public that is demanding ever-increasing levels of quality and accountability, there will be a growing market for evaluation research.

PURPOSES OF EVALUATION RESEARCH

A wide variety of activities are subsumed under the broad rubric of ”evaluation research.” This diversity proceeds from the multiplicity of purposes underlying evaluation activities. Chelimsky (1997) identifies three different purposes of evaluation: evaluation for accountability, evaluation for development, and evaluation for knowledge.

Accountability. From the perspective of auditors and funding agencies, evaluations are necessary to establish accountability. Evaluations of this type frequently attempt to answer the question of whether the program or policy ”worked” or whether anything changed as a result. The conceptual distinction between program and policy evaluations is a subtle but important one. Programs are usually characterized by specific descriptions of what is to be done, how it is to be done, and what is to be accomplished. Policies are broader statements of objectives than programs, with greater latitude in how they are implemented and with potentially more diverse outcomes. Questions addressed by either program or policy evaluations from an accountability standpoint are usually cause-and-effect questions requiring research methodology appropriate to such questions (e.g., experiments or quasi-experiments). Studies of this type are often referred to as summative evaluations (Scriven 1991) or impact assessments (Rossi and Freeman 1993). Although the term ”outcome” evaluation is frequently used when the focus of the evaluation is on accountability, this term is less precise, since all evaluations, whether conducted for reasons of accountability, development, or knowledge, yield outcomes of some kind (Scriven 1991).

Development. Evaluation for development is usually conducted to improve institutional performance. Developmental evaluations received heightened importance as a result of public pressure during the 1980s and early 1990s for public management reforms based on notions such as ”total quality management” and ”reinventing government” (e.g., see Gore 1993). Developmental evaluations often address questions such as: How can management performance or organizational performance be improved? What data systems are necessary to monitor program accomplishment? What are appropriate indicators of program success and what are appropriate organizational goals? Studies designed primarily to improve programs or the delivery of a product or service are sometimes referred to as formative or process evaluations (Scriven 1991). In such studies, the focus is on the treatment rather than its outcomes. Depending on the specific question being addressed, methodology may include experiments, quasi-ex-periments, or case studies. Data may be quantitative or qualitative. Formative or process evaluations may be sufficient by themselves if a strong relationship is known to exist between the treatment and its outcomes. In other cases, they may be accompanied by summative evaluations as well.

Knowledge. In evaluation for knowledge, the focus of the research is on improving our understanding of the etiology of social problems and on detailing the logic of how specific programs or policies can ameliorate them. Just as evaluation for accountability is of greatest interest to funding or oversight agencies, and evaluation for performance is most useful to program administrators, evaluation for knowledge is frequently of greatest interest to researchers, program designers, and evaluators themselves. Questions might include such things as the causes of crime, homelessness, or voter apathy. Since these are largely cause-and-effect questions, rigorous research designs appropriate to such questions are generally required.

CONTEMPORARY ISSUES IN EVALUATION

Utilization of Findings. Implicit in the enterprise of evaluation research is the belief that the findings from evaluation studies will be utilized by policy makers to shape their decisions. Indeed, such a view was espoused explicitly by Campbell (1969), who argued that social reforms should be regarded as social experiments and that the findings concerning program effectiveness should determine which programs to retain and which to discard. This process of rational decision making, however, has not been consistently embraced by policy makers and has been a source of concern and disillusionment for many evaluators. Rossi (1994) sums up the situation by noting:

Although some of us may have entertained hopes that in the “experimenting society” the experimenter was going to be king, that delusion, however grand, did not last for long. It often seemed that programs had robust lives of their own, appearing, continuing, and disappearing following some unknown processes that did not appear responsive to evaluations and their outcomes. (p. 26)

One source of the utilization problem, as Weiss (1975, 1987) has noted, is the fact that evaluations take place in a political context. Although accomplishing its stated objectives is important to program success, it may not be the only—or even the most important—measure of program success. From this perspective, it is not that administrators and policy makers are irrational—they simply use a different model of rationality than do evaluators. Indeed, the view of policy makers and program administrators may be more ”rational” than that of evaluators because it has been shown repeatedly that programs can and do survive negative evaluations. Programs are less likely, however, to survive a hostile congressional committee, negative press, or lack of public support. There are generally multiple stakeholders, often with competing interests, associated with any large program. Negative findings are of very little use to individuals whose reputations and jobs are dependent on program success. Thus, rather than bemoaning a lack of utilization of findings, evaluators need to recognize that evaluation findings represent only one piece of a complex political process.

Evaluators concerned with utilization frequently make a distinction between the immediate or instrumental use of findings to make direct policy decisions versus the conceptual use of findings, which serves primarily to enlighten decision makers and perhaps influence later decision making (Leviton and Hughes 1981). In a related vein, Scriven (1993) makes an important distinction between ”lack of implementation” and ”lack of utilization.” Lack of implementation merely refers to a failure to implement recommendations. In contrast, utilization is more ambiguous. It is often not clear what outcomes or actions actually constitute a utilization of findings. Evaluation findings can have great utility but may not necessarily lead to a particular behavior. For example, a consumer can read an evaluation of a product in a publication such as Consumer Reports and then decide not to buy the product. Although the evaluation did not lead to a particular behavior (i.e., purchasing the product), it was nonetheless extremely useful to the consumer, and the information can be said to have been utilized. Some observers have noted that the concern about underutilization of evaluation findings belies what is actually happening in the field of evaluation research. Chelimsky and Shadish (1997) provide numerous examples of how evaluation findings have had substantial impacts on policy and decision making, not only in government but also in the private sector, and not only in the United States but internationally as well.

Quantitative Versus Qualitative Research. The rise of evaluation research in the 1960s began with a decidedly quantitative stance. In an early, influential book, Suchman (1967) unambiguously defined evaluation research as ”the utilization of scientific research methods and techniques” (p. 7) and cited a recent book by Campbell and Stanley (1963) on experimental and quasi-experimental designs as providing instruction on the appropriate methodology. It was not long, however, before the dominance of quantitative methods in evaluation research came under attack. Cook (1997) identifies two reasons. First, there has been a longstanding debate, especially in sociology, over the merits of qualitative research and the limits of quantitative methods. Sociologists brought the debate with them when they entered the field of evaluation. Second, evaluation researchers, even those trained primarily in quantitative methods, began to recognize the epistemological limitations of the quantitative approach (e.g., Guba and Lincoln 1981). There were also practical reasons to turn toward qualitative methods. For example, Weiss (1987) noted that quantitative outcome measures are frequently too insensitive to detect program effects. Also, the expected time lag between treatment implementation and any observed outcomes is frequently unknown, with program effects often taking years to emerge. Moreover, due to limited budgets, time constraints, program attrition, multiple outcomes, multiple program sites, and other difficulties associated with applied research, quantitative field studies rarely achieved the potential they exuded on the drawing board. As a result, Weiss recommended supplementing quantitative with qualitative methods.

Focus on the quantitative-qualitative debate in evaluation research was sharpened when successive presidents of the American Evaluation Association expressed differing views on the matter. On the qualitative side, it was suggested that the focus on rigor associated with quantitative evaluations may have blinded evaluators to ”artistic aspects” of the evaluation process that have traditionally been unrecognized or simply ignored. The time had come ”to move beyond cost benefit analyses and objective achievement measures to interpretive realms” in the conduct of evaluation studies (Lincoln 1991, p. 6). From the quantitative perspective, it was acknowledged that while it is true that evaluations have frequently failed to produce strong empirical support for many attractive programs, to blame that failure on quantitative evaluations is akin to shooting the messenger. Moreover, at a time when research and statistical methods (e.g., regression discontinuity designs, structural equations with latent variables, etc.) were finally catching up to the complexities of contemporary research questions, it would be a shame to abandon the quantitative approach (Sechrest 1992). The ensuing controversy only served to polarize the two camps further.

The debate over which approach is best, quantitative or qualitative, is presently unresolved and, most likely, will remain so. Each paradigm has different strengths and weaknesses. As Cook (1997) points out, quantitative methods are good for generalizing and describing causal relationships. In contrast, qualitative methods are well suited for exploring program processes. Ironically, it is the very differences between the two approaches that may ultimately resolve the issue because, to the extent that their limitations differ, the two methods used jointly will generally be better than either used singly (Reichardt and Rallis 1994).

Research Synthesis. Evaluation research, as it was practiced in the 1960s and 1970s, drew heavily on the experimental model. The work of Donald Campbell was very influential in this regard. Although he is very well known for his explication of quasi-experimental research designs (Campbell and Stanley 1963; Cook and Campbell 1979), much of his work actually de-emphasized quasi-experimen-tation in favor of experiments (Shadish et al. 1991). Campbell pointed out that quasi-experiments frequently lead to ambiguous causal inferences, sometimes with dire consequences (Campbell and Erlebacher 1970). In addition, he noted that experiments have wide applicability, even in applied settings where random assignment may not initially seem feasible (Campbell and Boruch 1975). Campbell also advocated implementing such rigorous methods in the evaluation of social programs (Campbell 1969). As a result, Campbell is frequently credited with proposing a rational model of social reform in which a program is first evaluated using rigorous social science methods, such as experiments, when possible, and then a report is issued to a decision maker who acts on the findings.

Whatever its source, it was not long before the rational model was criticized as being too narrow to serve as a template for evaluation research. In particular, Cronbach and colleagues (Cronbach et al. 1980) argued that evaluation is as much a political process as a scientific one, that decisions are rarely made but more likely emerge, that there is rarely a single decision maker, and that programs are often amorphous undertakings with no single outcome. From Cronbach’s perspective, the notion that the outcome of a single study could influence the existence of a program is inconsistent with the political realities of most programs.

Understanding the ensuing controversy requires an understanding of the notion of validity. Campbell distinguished between two types of validity: internal and external (Campbell 1957; Campbell and Stanley 1963). Internal validity refers to whether the innovation or treatment has an effect. In contrast, external validity addresses the issue of generalizability of effects; specifically, ”To what populations, settings, treatment variables, and measurement variables can this effect be generalized” (Campbell and Stanley 1963, p. 5). Campbell clearly assigned greater importance to internal validity than to external validity. Of what use is it, he asked, to generalize experimental outcomes to some population if one has doubts about the very existence of the relationship that one seeks to generalize (Shadish et al. 1991)? Campbell’s emphasis on internal validity was clearly consistent with his focus on experiments, since the latter are particularly useful in examining causal relationships.

In contrast, Cronbach (1982) opposed the emphasis on internal validity that had so profoundly shaped the approach to evaluation research throughout the 1960s and 1970s. Although experiments have high internal validity, they tend to be weak in external validity; and, according to Cronbach, it is external validity that is of greatest utility in evaluation studies. That is, decision makers are rarely interested in the impact of a particular treatment on a unique set of subjects in a highly specific experimental setting. Instead, they want to know whether a program or treatment, which may not always be administered in exactly the same way from agency to agency, will have an effect if it is administered on other individuals, and in other settings, from those studied in the experimental situation. From Cronbach’s perspective, the rational model of evaluation research based on rigorous social research procedures is a flawed model because there are no reliable methods for generalizing beyond the factors that have been studied in the first place and it is the generalized rather than the specific findings in which evaluators are interested. As a result, Cronbach viewed evaluation as more of an art than a scientific enterprise.

The debate over which has priority in evaluation research, internal or external validity, seems to have been resolved in the increasing popularity of research syntheses. Evaluation syntheses represent a meta-analytic technique in which research results from numerous independent evaluation studies are first converted to a common metric and then aggregated using a variety of statistical techniques. The product is a meaningful summary of the collective results of many individual studies. Research synthesis based on meta-analysis has helped to resolve the debate over the priority of internal versus external validity in that, if studies with rigorous designs are used, results will be internally valid. Moreover, by drawing on findings from many different samples, in many different settings, using many different outcome measures, the robustness of findings and generalizability can be evaluated as well.

Although meta-analysis has many strengths, including increased power relative to individual studies to detect treatment effects, the results are obviously limited by the quality of the original studies. The major drawback to meta-analysis, then, deals with repeating or failing to compensate for the limitations inherent in the original research on which the syntheses are based (Figueredo 1993). Since many evaluations use nonexperimental designs, these methodological limitations can be considerable, although they potentially exist in experiments as well (e.g., a large proportion of experiments suffer from low external validity).

An emerging theory underlying research syntheses of experimental and nonexperimental studies, referred to as critical multiplism (Shadish 1993) and based on Campbell and Fiske’s (1959) notion of multiple operationalism, addresses these issues directly. ”Multiplism” refers to the fact that there are multiple ways of proceeding in any research endeavor, with no single way being uniformly superior to all others. That is, every study will involve specific operationalizations of causes and effects that necessarily underrepresent the potential range of relevant components in the presumed causal process while introducing irrelevancies unique to the particular study (Cook 1993). For example, a persuasive communication may be intended to change attitudes about an issue. In a study to evaluate this resumed cause-and-effect relationship, the communication may be presented via television and attitudes may be assessed using paper-and-pencil inventory. Clearly, the medium used underrepresents the range of potential persuasive techniques (e.g., radio or newspapers might have been used) and the paper-and-pencil task introduces irrelevancies that, from a measurement perspective, constitute sources of error. The term ”critical” refers to the attempt to identify biases in the research approach chosen. The logic, then, of critical multiplism is to synthesize the results of studies that are heterogeneous with respect to sources of bias and to avoid any constant biases. In this manner, meta-analytic techniques can be used to implement critical multiplist ideas, thereby increasing our confidence in the generalizability of evaluation findings.

The increasing use of research syntheses represents one of the most important changes in the field of evaluation during the past twenty-five years (Cook 1997). Research synthesis functions in the service of increasing both internal and external validity. Although it may seem that the use of research syntheses is a far cry from Campbell’s notion of an experimenting society, in reality Campbell never really suggested that a single study might resolve an important social issue. In ”Reforms as Experiments” (1969) Campbell states:

Too many social scientists expect single experiments to settle issues once and for all. . . . Because we social scientists have less ability to achieve “experimental isolation,” because we have good reason to expect our treatment effects to interact significantly with a wide variety of social factors many of which we have not yet mapped, we have much greater needs for replication experiments than do the physical sciences. (pp. 427-428)

Ironically, perhaps, the increasing use of research syntheses in evaluation research is perfectly consistent with Campbell’s original vision of an experimenting society.

DIRECTIONS FOR THE FUTURE

The field of evaluation research has undergone a professionalization since the early 1970s. Today, the field of evaluation research is characterized by its own national organization (the American Evaluation Association), journals, and professional standards. The field continues to evolve as practitioners continue the debate over exactly what constitutes evaluation research, how it should be conducted, and who should do it. In this regard, Shadish and colleagues (1991) make a compelling argument that the integration of the field will ultimately depend on the continued development of comprehensive theories that are capable of integrating the diverse activities and procedures traditionally subsumed under the broad rubric of evaluation research. In particular, they identify a number of basic issues that any theory of evaluation must address in order to integrate the practice of evaluation research. These remaining issues include knowledge construction, the nature of social programming and knowledge use, the role of values, and the practice of evaluation.

Knowledge Construction. A persisting issue in the field of evaluation concerns the nature of the knowledge that should emerge as a product from program evaluations. Issues of epistemology and research methods are particularly germane in this regard. For example, the controversy over whether quantitative approaches to the generation of knowledge are superior to qualitative methods, or whether any method can be consistently superior to another regardless of the purpose of the evaluation, is really an issue of knowledge construction. Other examples include whether knowledge about program outcomes is more important than knowledge concerning program processes, or whether knowledge about how programs effects occur is more important than describing and documenting those effects. Future theories of evaluation must address questions such as which types of knowledge have priority in evaluation research, under what conditions various knowledge-generation strategies (e.g., experiments, qua-si-experiments, case studies, or participatory evaluation) might be used, and who should decide (e.g., evaluators or stakeholders). By so doing, the field will become more unified, characterized by common purpose rather than by competing methodologies and philosophies.

Social Programming and Knowledge Use.

The ostensible purpose of evaluation lies in the belief that problems can be ameliorated by improving the programs or strategies designed to address those problems. Thus, a social problem might be remediated by improving an existing program or by getting rid of an ineffective program and replacing it with a different one. The history of evaluation research, however, has demonstrated repeatedly how difficult it is to impact social programming. Early evaluators from academia were, perhaps, naive in this regard. Social programs are highly resist to change processes because there are generally multiple stakeholders, each with a vested interest in the program and with their own constituencies to support. Complicating the matter is the fact that knowledge is used in different ways in different circumstances. Several important distinctions concerning knowledge use can be made: (1) use in the short term versus use in the long term, (2) information for instrumental use in making direct decisions versus information intended for enlightenment or persuasion, and (3) lack of implementation of findings versus lack of utilization of findings. These different types of use progress at different rates and in different ways. Consequently, any resulting program changes are likely to appear slow and sporadic. But the extent to which such change processes should represent a source of disappointment and frustration for evaluators requires further clarification. Specifically, theories of evaluation are needed that take into account the complexities of social programming in modern societies, that delineate appropriate strategies for change in differing contexts, and that elucidate the relevance of evaluation findings for decision makers and change agents.

Values. Some evaluators, especially in the early history of the field, believed that evaluation should be conducted as a value-free process. The value-free doctrine was imported from the social sciences by early evaluators who brought it along as a by-product of their methodological training. This view proved to be problematic because evaluation is an intrinsically value-laden process in which the ultimate goal is to make a pronouncement about the value of something. As Scriven (1993) has cogently argued, the values-free model of evaluation is also wrong. As proof, he notes that statements such as ”evaluative conclusions cannot be established by any legitimate scientific process” are clearly self-refuting because they are themselves evaluative statements. If evaluators cling to a values-free philosophy, then the inevitable and necessary application of values in evaluation research can only be done indirectly, by incorporating the values of other persons who might be connected with the programs, such as program administrators, program users, or other stakeholders (Scriven 1991). Obviously, evaluators will do a better job if they are able to consider explicitly values-laden questions such as: On what social values is this intervention based? What values does it foster? What values does it harm? How should merit be judged? Who decides? As Shadish and colleagues (1991) point out, evaluations are often controversial and explosive enterprises in the first place and debates about values only make them more so. Perhaps that is why values theory has gotten short shrift in the past. Clearly, however, future theory needs to address the issue of values, acknowledging and clarifying their central role in evaluation research.

The Practice of Evaluation. Evaluation research is an extremely applied activity. In the end, evaluation theory has relevance only to the extent that it influences the actual practice of evaluation research. Any theory of evaluation practice must necessarily draw on all the aforementioned issues (i.e., knowledge construction, social programming and information use, and values), since they all have direct implications for practice. In addition, there are pragmatic issues that directly affect the conduct of evaluation research. One important contemporary issue examines the relationship between the evaluator and individuals associated with the program. For example, participatory evaluation is a controversial approach to evaluation research that favors collaboration between evaluation researchers and individuals who have some stake in the program under evaluation. The core assumption of participatory evaluation is that, by involving stakeholders, ownership of the evaluation will be shared, the findings will be more relevant to interested parties, and the outcomes are then more likely to be utilized (Cousins and Whitmore 1998). From an opposing perspective, participatory evaluation is inconsistent with the notion that the investigator should remain detached from the object of investigation in order to remain objective and impartial. Not surprisingly, the appropriateness of participatory evaluation is still being debated.

Other aspects of practice are equally controversial and require clarification as well. For example: Who is qualified to conduct an evaluation? How should professional evaluators be trained and by whom? Should evaluators be licensed? Without doubt, the field of evaluation research has reached a level of maturity where such questions warrant serious consideration and their answers will ultimately determine the future course of the field.

Next post:

Previous post: