Measures of Effective Teaching: Final Reports

Amber M. Northern, Ph.D. Daniela Fairchild

1.9.2013

After three years, $45 million, and a staggering amount of video content, the Gates Foundation has released the third and final set of reports on its ambitious Measurements in Effective Teaching (MET) project (the first two iterations are reviewed here and here). The project attempted to ascertain whether it’s possible to measure educator effectiveness reliably—and, if so, how to do it. According to the project’s top-notch army of researchers (led by Tom Kane), it ain’t easy but it can be done and done well.

First, the research team used predictive data from the 2009–10 school year to randomly assign about 800 teachers in grades four through eight to classrooms (within their original schools) for 2010–11. The data showed a strong correlation between the predicted achievement of teachers’ students and their actual scores, as well as the magnitude of success. That the study randomly assigned teachers offers credence to the researchers’ contention that teachers’ success can be determined (and isn’t merely a byproduct of the quality of students who enter their classrooms in September). Second, they conducted a series of weightings to determine the ideal mix of past student-achievement data (value-added metrics, or VAM), classroom observations, and student surveys to identify the most effective teachers. Ultimately, the authors determined that a model that relies on VAM for between 33 and 50 percent of total teacher evaluation is best, with student surveys comprising 25 percent and classroom observations the rest.

Though the MET analysts concede that the best model of effectiveness heavily weights (as in 65 percent or more) teachers’ prior student-achievement gains on those same tests, they assert that the combo approach—which attaches lower percentages to test gains and higher to observations and student surveys—“demonstrated the best mix of low volatility year to year and ability to predict student gains on multiple assessments” (the latter referring to supplemental assessments described as “cognitively challenging”). This is where disagreement about the study commences: Jay Greene (no fan of the MET study) argues that because metrics like classroom observations don’t make the measure significantly more predictive (simply more reliable), but carry hefty price tags, we should be very wary of their inclusion. (We’ve heard this from him before.) Value-added experts agree that multiple years of test data are our best bet for reducing volatility in VAM measures, but that we should be careful in relying solely on them for high-stakes personnel decisions given their imperfections. Though Kane and colleagues are clear in describing VAM alone as a superior predictive measure, they could have done a much better job describing the tradeoffs involved in relying on VAM, student surveys, observations or any other measure for that matter—and what other purposes such measures might serve (such as guarding against gains due to simple test prep).

In any case, those looking for the policy takeaways: read the short summary report on the findings. Those looking to expand their statistical minds: read the three companion research papers. And those looking to seriously nerd out: watch for the full data sets, which Gates will be making available to other researchers in coming months.

SOURCE: Thomas Kane, et al., Measurements in Effective Teaching: Final Reports (Seattle, WA: Bill and Melinda Gates Foundation, January 2013).