Pencils down: What I learned from studying the quality of state tests

Editor’s note: This is the last in a series of blog posts that takes a closer look at the findings and implications of Evaluating the Content and Quality of Next Generation Assessments, Fordham’s new first-of-its-kind report. The prior five posts can be read here, here, herehere, and here.

It’s hard to believe that it’s been twenty-two months (!) since I first talked with folks at Fordham about doing a study of several new “Common Core-aligned” assessments. I believed then, and I still believe now, that this is incredibly important work. State policy makers need good evidence about the content and quality of these new tests, and to date, that evidence has been lacking. While our study is not perfect, it provides the most comprehensive and complete look yet available. It is my fervent hope that policy makers will heed these results. My ideal would be for states to simply adopt multi-state tests that save them effort (and probably money) and promote higher-quality standards implementation. The alternative, as many states have done, is to go it alone. Regardless of the approach, states should at least use the results of this study and other recent and forthcoming investigations of test quality to inform their thinking about the next generation of state tests.

With all of that throat clearing out of the way, I thought I’d take advantage of this opportunity to briefly discuss five of the biggest lessons I’ve learned along the way in implementing this new methodology.

First, all of these tests (PARCC, Smarter Balanced, ACT Aspire, MCAS) are of high quality and deserve our praise. These are not your grandmother's multiple-choice tests. They require complex skills and ensure that students actually demonstrate understanding in order to succeed, rather than just selecting the right answer. Certainly there were flaws with each assessment—these are laid out in the report, and I hope the programs consider our reviewers’ suggestions in fixing them—but they nonetheless constitute a genuine step forward in terms of content. What is the benefit of having better assessments that match the expectations of the standards? Teachers receive more consistent messages about standards implementation—and hopefully, that implementation will improve over time.

Second, as Amber and Mike mentioned in the foreword to the report, these tests illustrate what is becoming my mantra about state tests (or really any educational policy): It's all about the tradeoffs. We want tests to include high-quality tasks that reinforce the major goals of the standards; however, measuring complex skills requires time. We want students to construct their own responses rather than simply selecting from a set of possible answers, but scoring those questions costs money and may be less than perfectly reliable. Simply put, all of our goals for state assessments cannot be satisfied with one test. The PARCC and Smarter Balanced tests include many of these desirable content emphases, but they take longer than conventional tests. ACT Aspire focused on measuring student progress across grades (and was shorter), but it didn’t include some of the key CCSS content. Is that a tradeoff we want to make? In short, states need to decide what they want out of state testing (ESSA offers them an opportunity to do this) and choose appropriate tests accordingly. 

Third, computer-adaptive tests (CATs) such as Smarter Balanced are just a different beast than traditional fixed-form tests. Thus, measuring the quality of CATs and fixed-form tests using the same methodology is fraught with difficulty. While you can shoehorn any test into this or that methodology, we need to develop better approaches to studying these assessments if they're going to be widely used. There is some work happening in this area already, but it’s not proceeding quickly enough. That said, there's a tendency for computer-adaptive advocates to excuse all manner of sins—about item quality, student exposure to aligned content, etc.—simply because the tests are adaptive, and that's not appropriate, either. Computer-adaptive tests clearly offer some advantages, but they offer challenges too. And we can't be naive about either their strengths or limitations.

Fourth, there is no perfect way to examine the quality of a state test. Any method is going to have flaws and challenges, and they all leave something important out. The methods that we used focus on the major shifts and expectations of new college- and career-ready standards. That's an admirable goal, and it’s something that's not so well covered by prior alignment methods. But it's also useful to simply say how well-aligned each test item is with a set of standards, and this methodology doesn’t do that. It evaluates instead a broader match to the concepts that are prioritized in the Common Core. In doing so, it provides much richer data about each item than prior alignment methods (such as the extent to which the reading items truly require close reading and analysis), but it's not obvious that all of these dimensions matter equally in the scheme of things. Certainly I think the methodology is a useful contribution, and I’m glad that I was a part of its first implementation. But as we and our reviewers point out in the report, this approach needs further refinement if it's going to be used again (to that end, we’re meeting with the relevant parties soon to help make that happen).

Last, but certainly not least, the shroud of secrecy around state tests has to be lifted. There were so many occasions during this review when we encountered security issues that impeded our review. That is, it's incredibly hard for experts conducting reviews to get access to state tests. Given this, I can’t help but wonder what it must feel like for a teacher or parent or student who wants to know more about the test. Certainly, there are practice tests available online, but we need to find a way to make the testing process much more open and less secretive. Doing this may also have the added bonus of helping to dispel some of the myths about state testing (e.g., the periodic sensationalist “leaked test item” news stories). Clearly, it will cost more to have a more open process, but I think that's a price worth paying.

One observation also bears mentioning at this point. That is that this study, while not published in a peer-reviewed journal and not “counting” toward traditional faculty metrics, was as sophisticated, complex, and important as anything I’ve written. I truly hope that the results are useful to policy makers and educators nationwide; that’s the reason I said “yes” when Amber Northern asked me to co-lead the work. I think that universities would be well advised to more seriously consider diverse measures of impact (while not diminishing the value of traditional forms of research and publication) because this kind of work can be just as rewarding and valuable as anything that goes into a peer-reviewed journal.

Finally, I’d like to thank Fordham for the opportunity to co-lead this work. In particular:

  • Nancy Doorey for being my copilot and really paying attention to the details that were needed to complete the work;
  • Amber Northern for inviting me to this project and helping Nancy and me keep our eyes on the big picture; and
  • Victoria Sears for her phenomenal job of managing the ridiculously complicated logistics of this project and only involving me when I needed to be involved.

It’s been fun. I’ve learned a lot. And I’m so glad it’s over.