How much should we rely on student test achievement as a measure of success?

Getty Images/number1411

The use of standardized tests as a measure of student success and progress in school goes back decades. This practice was formalized by the 2001 passage of the No Child Left Behind Act (NCLB), which established the broader use of test scores as a measure of school quality nationwide. The 2009 Race to the Top federal grant program promoted teacher evaluation reforms that also included the use of standardized tests as a component of a teacher’s evaluation.

But there has been pushback against the use of tests. Some academics and advocates, prominently including the teachers’ unions, have raised various concerns about the consequences of reliance (or overreliance) on test scores for school and teacher accountability purposes. And while there is certainly academic and policy disagreement about the efficacy of using test scores for accountability purposes, there is no doubt that policymakers are scaling back the mandated use of tests. The 2015 passage of the Every Student Succeeds Act (ESSA), for instance, continues NCLB’s requirement that students be tested annually from third to eighth grade, but eliminates much of the federal role in enforcing test-based accountability.

More recently, however, policy scholars have begun to question whether test scores are a metric that we should really care about, pointing out that test score gains are not always associated with changes in other schooling outcomes. But this critique of tests as a measure has gone beyond a narrow academic policy debate. As recent headlines in Forbes show—for instance, “How Much Do Rising Test Scores Tell Us About a School?” and “Is the Big Standardized Test a Big Standardized Flop?”—this debate about tests as a measure of success is now reaching a much broader audience.

A recent report by Collin Hitt, Michael McShane, and Patrick Wolf (herein cited as HMW) is the latest warning from a number of prominent education policy researchers about using test scores, though their focus is the use of tests as a measure in school choice initiatives. The report focuses on a number of studies that examine the effects of different school choice programs on both student test scores and long-term outcomes (such as high school graduation and college enrollment), and examines how well test score impacts in these studies align with the attainment impacts. The authors find very little correlation between the two and conclude that “test scores should be put in context and should not automatically occupy a privileged place over parental demand and satisfaction as short-term measures of school choice success or failure.”

The measured statement that test scores should not automatically occupy a privileged place as a measure of success is at odds with the much stronger claim by Peter Greene that “test scores do not tell you what they claim they tell you. They are less like actionable data and more like really expensive noise.” But we think this is premature, given the evidence connecting student success on tests with the later life outcomes we could probably all agree are important, such as college-goingness and labor market success.

Test scores and later life outcomes?

The idea that tests measure knowledge and that it’s beneficial for students to acquire more knowledge in schools seems pretty uncontroversial. But it also isn’t crazy to argue that test scores are only an intermediate measure of what we really care about: the extent to which students are gaining knowledge in school that enhances their later life prospects. This view raises the question whether this intermediate measure does a good job of capturing what we really care about: the underlying learning that really is important for later life success.

There is a vast amount of literature linking test scores and later life outcomes, such as educational attainment, health, and earnings. Hanushek provides an excellent review of the extant literature on the relationship between cognitive skills, as proxied by test scores and individual incomes in developed and developing countries, and concludes that there is considerable evidence that test scores are directly related to later life outcomes. For example, in the U.S. context, students who score one standard deviation higher on math tests at the end of high school have been shown to earn 12 percent more annually, or $3,600 for each year of working life in 2001. Similarly, Heckman, Stixrud, and Urzua find that test scores are significantly correlated not only with educational attainment and labor market outcomes (employment, work experience, choice of occupation), but also with risky behavior (teenage pregnancy, smoking, participation in illegal activities).

The idea that students who learn more in school—and hence perform better on tests—have better later outcomes because of that learning has a good deal of face validity. It is no great leap to imagine that students who score in the ninetieth percentile on a science test are more likely to be successful scientists than those who score in the tenth percentile, and that this score reflects a better understanding of science. But we might be less sure that smaller differences in student test achievement are meaningful; and as Hanushek notes, these observed correlations do not necessarily reflect causal effects of schools on later life outcomes.

Maybe students who do well on tests are the same students who wake up in the morning, go to work on time, and work hard. Test achievement is also likely to largely reflect learning opportunities outside of school—the supportiveness of families or the communities in which students live. This is why scholars doubt that static measures of test performance alone are reflective of contributions that schools or teachers make toward student learning—a popular critique of those who doubt the use of tests for school accountability purposes.

Both of the above arguments highlight why it is difficult to attribute the observed associations between test scores and later life outcomes to the causal effect of schools. The key question is whether interventions that boost students’ test scores are also likely to lead to better future outcomes for students. Certainly, a lack of an underlying causal link between test scores and long-term outcomes should lead us to consider downplaying (or perhaps eliminating) tests as a measure of student success. Unfortunately, definitively establishing such a causal link is challenging, given that it would be unethical to design an experiment where we randomly provide better education to some students, measure their test scores, and assess whether improvements in test scores lead to better life outcomes. Therefore, what we know about the causality of this relationship comes from a limited number of studies that examine the causal effects of different educational inputs (e.g., schools, teachers, classroom peers) on both student test scores and later life outcomes. If a study finds test score impacts and adult outcome impacts that are not in the same direction, this might be regarded as evidence that test scores do not affect the later life outcomes we care about.

So, what does the literature say about whether there is a causal link? While there are certainly studies that find test-score and long-term-outcome effects that are not in the same direction (as cited in the HMW report), our reading of the broader literature in this context seems to indicate that they are outnumbered by the studies finding evidence of a strong causal link between test scores and later life outcomes. Perhaps the most influential study of all in this strand was conducted by Chetty, Friedman, and Rockoff. Examining the long-term effects of teacher quality (assessed based on their effect on student test scores), the authors find that students who are assigned to highly effective teachers in elementary school are more likely to attend college and earn higher salaries.

Another study by Raj Chetty and co-authors examines the long-term effects of peer quality in kindergarten (once again proxied by test scores) using the Tennessee Student Teacher Achievement Ratio (STAR) experiment, and finds that students who are assigned to classrooms with higher quality peers have higher college attendance rates and adult earnings. Similarly, using the Tennessee STAR experiment, a recent study by Susan Dynarski and colleagues looks at the effects of smaller classes in primary school and find that the test score effects at the time of the experiment are an excellent predictor of long-term improvements in postsecondary outcomes. Lafortune, Rothstein, and Schanzenbach and Jackson, Johnson, and Persico investigate the effects of school finance reform on test scores, educational attainment, and earnings, and find significant benefits of an increase in school spending on both test scores and adult outcomes.

Finally, there are a number of studies in the school choice context (cited in the HMW report) that show certain school choice programs having positive effects on both test scores and later life outcomes. For example, Angrist and colleagues examine the effects of Boston’s charter high schools and conclude that charter effects on college-related outcomes are strongly correlated with gains on earlier tests. Dobbie and Fryer find that attending a high-performing charter school not only increases test scores, but also significantly reduces the likelihood of engaging in risky behavior.

These two studies also highlight an important concern regarding the HMW report, which presents these studies as evidence of misalignment between test scores and long-term effects. While both studies find statistically insignificant, albeit positive, attainment effects (presented as evidence in the HMW report), they both find significant effects on other long-term outcomes that we care about (a shift from two-year to four-year college enrollment in the former study, and a significant effect on risky behavior in the latter).

Overall, all of these studies suggest that interventions that move the needle on test scores also improve later life outcomes. Thus, we contend that the weight of empirical evidence lends support to the argument for using test scores as a measure of success in education systems. This does not mean that the test score effects of educational interventions will always align with their effects on adult outcomes. It is easy to make the case that interventions can and do improve later life outcomes without affecting the cognitive skills of children. In short, test scores will not encompass the full impact of schools and teachers on students, and hence we should not expect them to fully capture all the contributions that schools and teachers make toward influencing long-term student outcomes.

But we need to think carefully about what that might mean for education policy and practice. From a practical perspective, we can’t wait many years to get long-term measures of what schools are contributing to students. This does not mean that test scores ought to be the exclusive or even primary short-term measures, but if one believes in school accountability and that test scores ought to be down-weighted, it is important to consider what alternative measures of success are out there and how reliable they are. For instance, there are concerns that non-test outcomes, such as attendance, grades, suspensions, and high school graduation rates, are arguably more “gameable” than test scores. And we certainly know less empirically about the causal connections between these types of outcomes and long-term student success.

Where one lands on the use of test scores to measure student or schooling success is clearly a matter of subjective judgement and policy debate. But, importantly, that debate should be framed by the right interpretation of the empirical evidence, most of which does suggest that test scores are a good intermediate measure of student success.

Dan Goldhaber is the director of the National Center for Analysis of Longitudinal Data in Education Research, and the director of the Center for Education Data and Research at the University of Washington. Umut Özek is a senior researcher at the American Institutes for Research and an affiliated researcher with the National Center for Analysis of Longitudinal Data in Education Research.

Editor’s note: This essay was adapted from a CALDER policy brief of the same title.

The views expressed herein represent the opinions of the author and not necessarily the Thomas B. Fordham Institute.