03 February 2011

Evidence That Value Added Evaluation May Be Unfair

Colorado's legislature has recognized that evaluating teachers and principals based on absolute levels of CSAP performance is unfair, because at the classroom and school level, CSAP performance is largely a function of student affluence and entire educational career school performance, rather than current teacher and current school performance.

As a result, the state has moved to a system where "value added" is measured for each teacher. This approach looks at CSAP improvement of children in a class during the course of a class. Are students improving more than average, less than average or typically?

Clearly, this is seems more fair. At the very least, it disentangles a lot of what happened before a child arrived in a teacher's class. But, does it really measure teaching quality?

A new twin study of U.K. twins finds that academic improvement measures like the system used in Colorado also have significant genetic components.

"These findings do not mean that educational quality is unimportant, in fact environmental factors were just as important as genetic factors. However, these results do suggest that children bring characteristics to the classroom that influence how well they will take advantage of the quality."

Surprisingly, according to the open access study, value added may actually be worse at assessing teacher performance than raw achievement scores:

Raw achievement shows moderate heritability (about 50%) and modest shared environmental influences (25%). Unexpectedly, we show that for indices of the added value of school, genetic influences remain moderate (around 50%), and the shared (school) environment is less important (about 12%). . . . At first glance, this high degree of genetic overlap between different cognitive and academic measures suggests that correcting achievement measures for general cognitive ability would remove the genetic influence on achievement. However, this genetic overlap is not 100%, so there could be residual genetic influences on achievement that are independent of those on general cognitive ability, or even previous measures of achievement. . . . The results were striking, indicating that even when previous achievement and a child's general cognitive ability are both removed, the residual achievement measure is still significantly influenced by genetic factors (heritabilities of 48% and 37% respectively for teacher-ratings and test data). The main point, to which we shall return, is that corrected-achievement scores are influenced by genetic factors that are independent of those influencing g or previous achievement.

Of course, great individual differences in educational improvement ability don't necessarily detract from a value added measure of teacher evaluation based on test scores if the mix in any given class of thirty kids in a classroom, or hundreds of kids in a school, tends to average out.

We know that raw achievement scores have a strong socio-economic component that varies greatly from school to school. We don't know, and the study doesn't tell us, if this achievement controlled improvement factor varies in a systemic way that is likely not to average out from school to school. For example, the study doesn't tell us, and U.K. data would probably not be very helpful in any case in measuring, if there were strong ethnic components to that variation.

If value added components are strongly heritable, but the genetic component of a value added effect controlled for raw achievement is randomly distributed from one classroom to the next, a value added measure is still valid.

But, if there is a clear pattern in which some category of students routinely improves a lot even after controlling for past achievement, while another category of students routinely improves little, then the new measure may simply create a more subtle version of the problem that causes Colorado to go from absolute performance to value added based evaluations of teachers and schools in the first place.

Eyeballing the latest round of Colorado value added evaluations data, the only distinct trend that I noted was that schools with a lot of English language learners seemed to have an edge in value added measures, presumably because poor English language skills of smart students suppressed their prior year achievement, but this effect presumably rapidly eased as the students gained English language mastery.

Colorado's system also addresses concerns that learning disabilities might have an effect independent of raw achievement levels, by limiting the way that the system can be used in classes with many special education students.


Michael Malak said...

Measuring student improvement still doesn't control for the effect of affluence during the period under evaluation. In that sense, tying an individual teacher's salary to test score improvement is unfair.

But more importantly, it's counterproductive since, of course, it leads to "teaching to the test".

At most, standardized test results should be utilized at the most macro level -- such as in multiple schools. Practically speaking, having standardized testing presents too much of a temptation to school leaders to apply it to finer and finer levels, thus leading to the counterproductivity. So it's best to just not have standardized testing.

Andrew Oh-Willeke said...

Teaching to the test is fine, so long as the test accurately measures what you are seeking to accomplish. And, if it isn't testable at all, should we really be teaching it?

Affluence during the period of evaluation appears to be a relatively minor factor after controlling for prior performance.

Michael Malak said...

There are different kinds of test. The most limited kind of test is the pure standardized test with standard questions and "standard" (i.e. multiple-choice) answers. Even the SAT people have recognized this limitation and expanded into essays -- standardized questions but not standardized answers.

The broadest type of test but the least amenable to aggregation is the personal one-on-one or many-on-one assessment. Oral defense of a thesis is one example. Neither the questions nor the answers are standardized.

Finally, there are just some things that are difficult or impossible to test that are or should be taught in school. Part of the Montessori Elementary curriculum is the "Going Out Program", where students, as part of a research project or just a quick research question, identify a location to visit that would answer the question, and then use the telephone (along with, ideally, a printed telephone book, though this is getting more difficult with each passing year -- I almost bought a copy of 5280 for the elementary classroom for them to organize a school trip to the Western Stock Show, until I realized that its display ad contained no phone number or URL -- I guess they expected you to type in "Western Stock Show" into Google) to arrange the trip. This is part of the independence that is the greatest benefit that Montessori imparts. Independence is something that is much more amenable to evaluation as it happens rather than tested at a designated time and in a designated place, let alone a standardized test.