Test-based accountability deserves to be assessed against a valid hypothesis, not a straw man

5.2.2018

Last month I published a five-part critique of a recent AEI paper by Collin Hitt, Michael McShane, and Patrick Wolf that looked at the connection (or lack thereof) between test scores and long-term outcomes in school choice programs. Not surprisingly, last week Pat responded with a forceful rebuttal. I think many of his points missed the mark, as I noted on Twitter. But this one I liked:

The first rule of science is that you can’t prove a negative. The second rule of science is that the burden of proof is always on the person claiming that a relationship between two factors actually exists. One develops a theoretical hypothesis, such as “The achievement effects from school choice evaluations reliably predict their attainment effects.” One then collects as much good data as possible to test that hypothesis, certainly employing an expansive definition of school choice unless and until you have an overwhelming number of cases. One then conducts appropriate statistical tests on the data. If the results are largely consistent with the hypothesis, then one conditionally accepts the hypothesis: “Hey, it looks like achievement effects might predict attainment effects just as hypothesized.” If the results are largely inconsistent with the hypothesis, as in the case of our study, one retains a healthy amount of doubt regarding the association between achievement and attainment results of school choice evaluations. That’s what scientists do.

All fair, and a useful frame. But also telling, as we shall see.

Pat’s hyper-pithy hypothesis

“The achievement effects from school choice evaluations reliably predict their attainment effects.”

That is certainly parsimonious, but there are two problems with it. First, it’s simplistic. Which achievement effects? For all students, or certain subgroups? Are we talking about elementary, middle, or high schools? Which kinds? What counts as “attainment”? How big do the effects have to be? When would we expect these effects to move in the same direction, and when might we reasonably expect them to diverge? What exactly does “reliably” mean in this context? And what is the justification for that definition or standard?

The second problem with this hypothesis, as stated, is it’s only relevant to a small subset of the policy debates we’ve been having, and that Pat et al. referenced in their original paper. Yes, if this hypothesis is proven wrong—if it turns out that test scores don’t reliably predict important long-term outcomes—it would indicate that policymakers should be cautious about killing off school choice programs prematurely. Instead they should wait to see what their long-term impacts are, too, because there’s a decent chance that they will be more positive. On this I agree.

But the evidence examined against this narrow hypothesis would not tell us anything about the wisdom of holding individual schools accountable for short-term test-score changes, either within school choice programs or writ large. For that we’d need to craft a hypothesis, or set of hypotheses, that were directly related to that question.

My hypotheses about test-based accountability

So let me take a crack at identifying a trio of hypotheses that those of us who support test-based accountability would embrace and like to test. The first is about students, the second about elementary and middle schools, and the third about high schools.

Students who learn dramatically more at school, as measured by valid and reliable assessments, will go on to graduate from high school, enroll in and complete postsecondary education, and earn more as adults than similar peers who learn less. This is the heart of the matter for test-based accountability: We think student achievement matters for individuals in the long run. Of course, there a whole bunch of caveats that any reasonable person would apply. Learning just a little bit more probably isn’t enough to affect the longer term outcomes much; to change a child’s life trajectory, the intervention has to be pretty dramatic. We are more likely to see big impacts for low-income kids, for whom schools matter more, than for affluent children, many of whom are likely to graduate from high school and college regardless of their K-12 experience. And if we had ways to measure other important skills, knowledge, and characteristics that schools work to inculcate in children but that don’t reveal themselves in tests of ELA and math, we might see an even stronger association between school-based learning gains and long-term outcomes. But still, kids who become a lot better at math, reading, and writing than they otherwise would have should go on to have better outcomes than those who don’t. If not, that’s a problem for judging schools based on test score changes.
Elementary and middle schools that dramatically boost the achievement of their students should also boost their long-term outcomes, including high school graduation, postsecondary enrollment, performance, and completion, as well as later earnings. All of the caveats from above apply here, too.
High schools that dramatically boost the achievement of their students should also boost their long-term outcomes, including postsecondary enrollment, performance, and completion, and earnings. Same caveats apply. But note, too, a critical difference from elementary and middle schools. For the former, high school graduation is a legitimate “long term” outcome. But for high schools, it’s another short-term indicator, akin to test scores. And we know from prior research that high-expectations high schools may boost achievement while decreasing their graduation rates, as some kids decide they are not up for the challenge. So I would never hypothesize that we’d see high school achievement and graduation rates moving in the same direction. We also would have a different hypothesis for certain types of high schools, as I explained in my original critique. Career and technical Education, early college, and selective enrollment high schools, in particular, would be expected to have different outcomes for achievement and attainment, given their idiosyncratic missions and student populations.

Note that all three of my hypotheses call for “dramatic” learning gains, as I believe those are what will lead to changes in students’ life trajectories. Many of us testing hawks are big fans of KIPP and other high-performing charter networks and want to see them replicated because they are real outliers when it comes to student achievement. We believe that they are changing lives because their students are making gains that are much larger compared to similar peers. But we wouldn’t necessarily assume that a school performing at the fifty-fifth percentile would yield better results than a school at the fiftieth percentile when it comes to real-world outcomes.

The same goes for the flip side of accountability: intervening in or closing down chronically low-performing schools. No state accountability system or charter school authorizer goes after institutions performing at, say, the fortieth or forty-fifth percentile in student achievement growth. Rather they target those at the fifth or tenth percentile—those that are several standard deviations from the mean, schools where students are making virtually no progress from year to year, or even going backwards. So what we want to know from research is: What are the odds that those chronically low-performing schools are having a positive impact on kids? That’s a very different question from the one Pat asked: whether schools or programs that do marginally better or worse on test scores do marginally better or worse on attainment. And my hypotheses—which should be tested empirically—assume that it’s extremely unlikely that very low-performing schools are somehow helping their students prepare for long-term success.

***

Pat is right, then, that we need to be clear about the hypothesis we’re testing. The review that he completed with Collin and Mike is appropriate for examining the relationship between achievement and attainment effects in school choice evaluations. (Though my serious qualms about which studies they included and how they analyzed the findings still stand.)

But that review is not at all appropriate for examining the assumptions—okay, hypotheses—upon which test-based accountability rests. Pat and his colleagues were stretching far beyond their findings when they wrote, in the original AEI paper, that “insofar as test scores are used to make determinations in ‘portfolio’ governance structures or are used to close (or expand) schools, policymakers might be making errors.”

Policymakers might be making errors—but we can’t know that from the studies that Pat and his colleagues examined. And that’s what’s wrong with their review: They went searching for evidence to disprove an overly simplistic hypothesis that is ultimately irrelevant to much of the debate over test-based accountability.

As Pat sometimes observes, I’m not a “scientist.” But I believe that the hypothesis that he and his colleagues claim to have disproved is what scientists would call a straw man, no?