How well do next-generation tests measure higher-order thinking skills?

Editor’s note: This is the fifth in a series of blog posts that takes a closer look at the findings and implications of Evaluating the Content and Quality of Next Generation Assessments, Fordham’s new first-of-its-kind report. The prior four posts can be read here, here, here, and here.

When one of us was enrolled in a teacher education program umpteen years ago, one of the first things we were taught was how to use Bloom’s taxonomy. Originally developed in 1956, it is a well-known framework that delineates six increasingly complex levels of understanding: knowledge, comprehension, application, analysis, synthesis, and evaluation. More recently—and to the consternation of some—Bloom’s taxonomy has been updated. But the idea that suitably addressing various queries and tasks requires more or less brainpower is an enduring truth (well sort of).  

So it is no surprise that educators care about the “depth of knowledge” (DOK) (also called “cognitive demand”) required of students. Commonly defined as the “type of thinking required by students to solve a task,” DOK has become a proxy for rigor even though it concerns content complexity rather than difficulty. A clarifying example: A student may not have seen a particular word or content before, so it might be difficult for her; but it is not necessarily “complex.”

In our recent study, we used Webb’s depth of knowledge classifications to assess cognitive demand because it is by far the most widely used approach to categorizing DOK. It is composed of four levels: Level 1 is the lowest (recall), Level 2 requires the use of a skill or concept, and Levels 3 and 4 consider higher-order thinking skills.

We captured the degree to which tests’ cognitive demand matches that called for by the Common Core State Standards (CCSS) (an expectation gleaned from the Council of Chief State School Officers’ Criteria for Procuring and Evaluating High-Quality Assessments). Specifically, we compared the depth of knowledge of ACT Aspire, MCAS, PARCC, and Smarter Balanced to those of the CCSS, as independently coded by nearly forty English and math content experts. (For ease, we’re reporting here on the level of DOK and the match—but not the match rating.)

Overall, we found that PARCC tests generally have the highest DOK in ELA/literacy, while ACT Aspire had the highest in mathematics (see dark blue bands in the figures below). However, our expert review panels found significant variability in the degree to which the four assessments match the distribution of DOK in the Common Core Standards, especially between the grades five and eight assessments for a given program.

To help contextualize these findings, we also compared (see below) the tests’ DOK distributions to those of fourteen highly regarded state assessments, as well as the distribution reflected in several national and international assessments—including Advanced Placement (AP), the National Assessment of Education Progress (NAEP), and the Program for International Student Assessment (PISA). (These were results drawn from other studies, not ours.)

We found that the Common Core standards call for a greater emphasis on higher-order skills than fourteen highly regarded previous state assessments in ELA/literacy at both grades five and eight—as well as in grade eight mathematics (they are similar at grade five). In addition, the eighth-grade CCSS in both ELA/literacy and math call for greater emphasis on higher-order thinking skills than NAEP, which is considered to be a high-quality, challenging assessment.

Overwhelmingly, the assessments included in our study were found to be more cognitively challenging than prior state assessments, especially in mathematics (where prior assessments rarely included items at DOK 3 or 4 at all). For instance, the percentage of score points at DOK 3 or 4 in PARCC’s eighth-grade test (69 percent and 25 percent in ELA and math respectively) exceeds that of Advanced Placement (AP) and NAEP in both subjects at the same grade level.[1] (See Appendix A of our report for more.)

In the end, a mix of items that themselves contain a mix of various cognitive levels is an important part of a high-quality test. The next-generation assessments we studied have far higher proportions of DOK 3 and 4 than did prior NCLB-era tests. Many proclaimed, hoped or suspected that to be true before our study was conducted. Now we know that it is.

[1] The DOK results for the assessments reviewed in our study are based on percentages of score points, whereas the other assessment results are based on percentages of items.


Amber M. Northern, Ph.D.
Amber M. Northern, Ph.D. is the Senior Vice President for Research at the Thomas B. Fordham Institute.