Popping Balloons: Data, Evaluation and Accountabilty

At the Monday December 9th Madison Board of Education meeting there was much discussion of data, evaluation, and by implication, accountability. Most of this was in the context of the Reading Recovery program and the broader discussion of teaching reading in MMSD that ensued, but some also occurred while the Strategic Plan items were on the table. It was very clear that Board members are looking to data and evaluations for the kind of clear guidance that, in the vast majority of the cases, data and evaluation cannot provide. This is why “data driven” policy making (as the Strategic Plan promises) is a mistake and a bit of a sham. Time to pop some of the data driven balloons.

I’m going to put forth some of the factors that limit the utility of the data schools have to work with and to explore these in the contexts of the issues discussed Monday. Note too that my list of factors is far from exhaustive, and that I am not saying “ignore data,” just that there needs to be a larger awareness of the limitations of such an approach.

Schools are not laboratories; Children are not lab animals.

This is another way of saying that there are too many variables that cannot be controlled for. With Reading Recovery, the implementation and effects will vary greatly by school and even year. There was also much talk of the changing demographics of the students and some concern about inconsistent administration of assessments. You can address some of this with statistical manipulation but public and policy-maker comprehension suffer, along with transparency (think Value Added).

Some factors are also missed and therefore not part of the statistical adjustments. I’m going to offer an anecdote about my older son to illustrate this (anecdotes have limits too, but I think this one is very telling).

At the end of second grade my son was identified for Summer remediation in reading. He was well behind where the charts said he should be. We already had him signed up for activities and did not take the offer.

We had read to, and with him, almost nightly since birth, but did no more, before or after the remediation recommendation.

By the middle of July that Summer, he was reading. Throughout August (and ever since), we had to remind him to turn off the light, put the book away and go to sleep almost every night. When tested in the Fall he was reading over three grades above grade level.

Had he been part of the Summer remediation, he would have appeared in the data as a resounding success, yet his gains would have had nothing to do with the remediation. Had we done the remediation, we would have been impressed with the results. We wouldn’t have known better and neither would have anyone reading an evaluation of the program.

This is a success story; much of the concern is with failures. In the Reading Recovery report MMSD used things like poverty, a single family home, mobility, and parental education to statistically control for factors that can make achievement harder. Yet these do not differentiate within the range of experiences in each category, nor the varying intensity of crises and struggles among those experiencing economic hardship. Things like lead exposure, fetal alcohol exposure and according to a new studies early life stress and maternal early life stress can effect cognitive development. Every poor home, every single parent home, every home with low parental educational attainment is not the same; and each and every home changes constantly. Homes are not controlled laboratories either.

One thing that was not considered in the report, or at the meeting, is that when a comparison sample was constructed for Reading Recovery students, it is possible that there were unrecognized but significant differences between the two groups. The narrative tells us that at some point district personnel — within the limits of allocated availability — said these kids need more help via Reading Recovery and some others didn’t. To me, this points to the likelihood of unquantified differences.

Alluded to in the Reading Recovery report — but not part of the analysis — and discussed at the meeting, were a variety of school-based factors that may or may not have contributed to success and failure. These include other support services, fidelity of implementation, follow up services, teacher training, integration with the classroom reading program and more. I’m sure they missed some; school poverty level and class size are two possibilities. Whatever the list looks like, I don’t think that anyone would disagree that there are school based factors that are difficult or impossible to include and that some of these may be important.

I want to pause here to give some praise to the team that prepared the Reading Recovery report. They do point to many limitations of their data, mentioning factors that were not analyzed while presenting multiple statistical manipulations, all in an effort to present alternative adjustments for different factors. This is what honest researchers do. Still, the very existence of the report, the fact that time and effort were spent, and that relatively sophisticated techniques were employed, undercuts the messages about limitations, especially to those who are not intimate with education research and quantitative analysis.

Sample Size Matters

With the Reading Recovery report we are talking about less than 300 students a year. Once you drill down to the school level the numbers are often in single digits. When numbers get relatively small, confidence in results diminishes (what is known as “confidence intervals” in statistics).

This is one of the stronger reasons why so many are wary of using achievement test results to evaluate and compensate teachers. Jim Doyle, Arne Duncan and Barack Obama are just wrong on this. They may have good sound bytes, but they don’t have a clue.

This is a good enough time to venture off on a little tangent on this issue of teacher compensation, since it offers some useful guidance from a similar area on how data use can intersect with sample size issues and other problems. Many of the better teacher compensation reforms fall under the heading of “Knowledge and Skills Based.” Part of what these seek to do is take large data pools (either from large studies or via meta analysis) and use them to identify teacher knowledge and skills that are significantly correlated with student achievement. Once these have been identified, compensation systems can be (and have been) constructed to reward teachers who have the desired knowledge and skills. In this way performance and pay are linked, but indirectly, through a mechanism that minimizes the problems of sample size. What Doyle and Duncan and Obama want, would be pay based on sample sizes as low as 15. I don’t understand how anyone familiar with research and data can think that is a good idea.

This leads back to local issues. The truth is that any local program evaluation will almost certainly not have sample sizes, controls, and results that will produce a clear policy choice. This is certainly the case with Reading Recovery (otherwise the discussion on Monday would have taken about five minutes instead of well over an hour). Large studies and meta analysis don’t always give clear guidance either. The U.S. Department of Education Institute of Education Sciences What Works Clearinghouse sets some standards for evaluating research and identifying “what works.” Their report on Reading Recovery is here. At the time it was issued Education Week summarized the findings as follows:

Just one program was found to have positive effects or potentially positive effects across all four of the domains in the review—alphabetics, fluency, comprehension, and general reading achievement. That program, Reading Recovery, an intensive, one-on-one tutoring program…

I have trouble accepting that the small sample size findings of the MMSD report more accurately gauge the value of the program than the What Works process.

Appropriateness of Measures

Board Member Lucy Mathiak made much of the fact that the Primary Language Arts Assessment used to measure early grade reading achievement is based on principles shared by Balanced Literacy and Reading Recovery (by implication as opposed to Direct Instruction principles, which she seems determined to expand in Madison). This is very ironic given her repeated references to the National Institutes of Health Reading Panel work. One of the most cogent criticisms of that work was that the chosen measures privileged Direct Instruction, a bias that was carried over to the Reading First program, where the bias produced corruption and yielded profits for people directly connected to the NIH Panel (more on the NIH Panel here, Panel Minority View here and Reading First here). Pot…kettle.

This is yet another reason why the What Works process is so important. They strive to eliminate biased research.

I also want to note that Mathiak was silent when the Talented and Gifted Plan was enacted with absolutely inappropriate achievement measures being employed to assess ability.

Just because she is inconsistent doesn’t mean she is wrong about this. The problem is huge. We have standardized tests of varying quality designed to sort — not assess knowledge, learning or ability — being used as the measure of education (see the National Center for Fair and Open Testing for much, much more).

This is one effect of NCLB and an effect that Duncan and Obama seem to want to expand (Diane Ravitch’s “Obama and Duncan Launch NCLB 2.0” is very good on some various other ways the current administration is following the path laid out by Spellings et al. of the last administration).

Everyone knows the WKCE is bad; some even know what it was designed for and what it wasn’t. But these same people (all of the MMSD Board included) continually expand the types of decisions based on WKCE and pat themselves on the back for being “data driven.” There is much in the Strategic Plan and elsewhere about finding, developing and employing better assessments, but meanwhile things go forward with the WKCE as the primary tool.

While reviewing the proposed Performance Measures for the Strategic Plan I was continually struck by the extensive use of the WKCE and appalled that MMSD had also embraced the NCLB fiction of making 100% proficiency the goal. WE are planning for failure.

I understand that we have to use what we have, but would be much more comfortable if decision makers did so with more awareness and less confidence. Some time ago I highlighted a post from Sherman Dorn titled “How can we use bad measures in decisionmaking?.” Sherman has some excellent thoughts. My short answer to his question is: “Very carefully and with our eyes wide open.”

Final Thoughts

Board President Arlene Silviera asked on Monday if they would be getting the Strategic Plan related evaluations in time for budget consideration. The answer was mostly no. What wasn’t said was that even if they do get them they aren’t going to make budget decisions much easier (if at all), and even if they do make them easier, they won’t necessarily be better.

I’m kind of back where I started at the my central point about the limited utility of data and evaluation. This can be hard to accept.

In the discussion of reading and Reading Recovery it was clear that all involved — with all their hearts — wanted children to learn to read. They saw lots of information saying that children were not learning to read and wanted to do something about it. They desperately want answers, and have been told, and have told themselves, that data and research will give them the answers. They have been lied to. There aren’t clear and simple answers.

There is complex and limited information which can and should inform — not drive — their decisions.

I too wish that some body of research said “There is one way to teach every child to read in a cost effective manner,” (I wish even more that some body of research said “There is one way to teach every child to love reading”). This will never be the case.

It is essential to stop pretending data belongs in the driver’s seat, then we will be clear to make good use of what we have; data and research, but also those parts of our knowledge and humanity that cannot be quantified.

Thomas J. Mertz

One response to “Popping Balloons: Data, Evaluation and Accountabilty”

Dick Schutz

December 10, 2009 at 11:04 am

The only “data” worthy of consideration is the % of “discontinuations”–the RR success rate. It’s clear that a high and stable % of children are not being taught to read. And the same practices are continued year after year–all in the name of “data driven.”

Some children learn to read with little or no formal instruction. Some learn (cite your son) in spite of shoddy instruction. Schools take credit for these accomplishments and attribute instructional failures to the student, parents, or society. And the beat goes on.

Popping Balloons: Data, Evaluation and Accountabilty

One response to “Popping Balloons: Data, Evaluation and Accountabilty”

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Pages

Blogroll

Fund a Classroom Project in Madison

Categories

Follow Blog via Email

Twitter Updates

Meta

Popping Balloons: Data, Evaluation and Accountabilty

Share this:

Related

One response to “Popping Balloons: Data, Evaluation and Accountabilty”

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Pages

Blogroll

Fund a Classroom Project in Madison

Categories

Follow Blog via Email

Twitter Updates

Meta