Sunday, February 5, 2012


Perhaps the only aspect of education on which pretty much everyone agrees is that teacher quality is exceedingly important. Similarly, just about everyone agrees that we need to come up with better ways of evaluating teachers. From there, almost no one agrees on anything.

Yet, evaluating teachers isn't all that hard. Somehow we can all do it. While I've been wrong a few times, most years I can tell by the third minute of my daughter’s school's open house which teachers will give her (and us) fits, which ones are difficult but worth it, which ones sweat all the wrong details, and which ones will probably misplace the final exams and just give them all As.  Malcolm Gladwell wrote in Blink of this concept, which is called “thin slicing.” To over-simplify (hey, even the Wikipedia explanation is nearly a thousand words), our brains can evaluate a huge array of factors in an instant and lead us to judgments that we don’t even know we've made. Thin-slicing is extremely fascinating, and quite useful for deciding pretty important things such as which employee to hire and which candidate to vote for, but a fair and reasonable evaluation method needs to not be so steeped in the inner mysteries of the human gut (eeeww). We need a way to quantify this.

The easiest method of teacher evaluation is to use standardized test scores--preferably from tests we’re already giving. Coincidentally, this is the method that we hear about most often. If you want to see a thoughtful analysis of what’s wrong with testing in general, you can find one here. Still, we’ll certainly fix that soon--if the amount we complain about it is any indication--so let’s not abandon the idea quite so fast. Unfortunately, testing as a means of teacher evaluation is terrible even if we have better tests, because:
  • Not everything is tested. Until there is a PSSA for musical theater and another for Chinese, some teachers are going to be left out. This means that teachers of non-tested content either sit out the evaluation process, and any incentives therein, or they get lumped in with others. If you’ve ever participated in a group project, you know how well this will work.
  • We can make the scores whatever you want. By diverting resources from actual teaching to test preparation (see also: here—same link as above, if I've already fooled you once) we can improve test scores while making the learning worse.
  • It's perverse. Changing how we teach, and what we teach, to improve our teacher ratings, especially if this is tied to us making more money, goes against the most fundamental ethical standards of our profession. 
The next easiest method is to use administrators’ evaluations. People who come from a business background like this idea, because much of the time in business people can be promoted up through the system just by being good at their jobs. While there are always cases of idiots getting promoted in spite of incompetence (see also: here), many workplaces are at least in theory meritocracies.

It's not quite the same in education, where switching to administration requires specific degrees and certification. Since teaching already requires proficiency in two fields--education and content area (or many content areas for elementary folks)--those who pursue administration have already been separated from people who would rather become more advanced within their subject area. Since administrators sometimes make less than teachers, at least when you look at it on a pay-per-day basis (which you'd hope they'll be clever enough to do, especially since they're supposed to be smart enough to be running the place), evaluating teachers based on the opinions of administrators starts to seem less obvious. While many administrators are fantastic in spite of all of this, we do see:
  1. Administrators who reduced the level of effort in their own classrooms to free up enough time to complete principal certification.
  2. Administrators who didn't love teaching quite enough to stay with it.
  3. Administrators who are in their current job in the possibly mistaken belief that it is for them just a way-station up the ladder toward superintendent. Some, of course, are on this path but simple math tells you that not every assistant principal is going to make it all the way there.
Besides, no administrator is an expert in every subject area. When they do have some expertise, sometimes it's a little knowledge that is most dangerous. If an administrator got a C- in your subject in high school, would he/she be more or less qualified to judge you?

After floating the administrator idea, the next idea is to apply easily-measured qualities. A colleague and I were discussing the relative quality of another teacher. She proposed to enter into evidence against this guy the fact that he never wore a tie. I looked down at my own chest, and saw no tie there either. This wasn't a surprise since I haven’t worn one, except for chamber choir gigs (not even the first day of school or meet-the-teacher night, these days), in about five years. She apologized awkwardly. Then I pointed out that, like me, this teacher often arrives at school by 6:45. She entirely discounted this qualification, based on the fact that she (who is, admittedly, awesome) hasn't darkened the door of the school much before 7:30 since, more or less, the days when I wore ties. 

In other words, we tend to be much more apt to apply scoring schemes that we ourselves would do well on.

The Internet tried to get in on this as well. Ratemyteacher uses Easiness, Helpfulness, Clarity, and Popularity. Ratemyprofessor adds Overall Quality and Hotness, because those are the things that differentiate school teachers from professors. 

Even with just a quick glance at these websites, a few problems become obvious. First, ratings are almost always from either the most disgruntled or most gruntled students. Therefore, most people's ratings show a Jekyll and Hyde dichotomy of saints who are also part-time serial killers. Second, this is apparently too much work for kids, at least in an age when they can bash their teachers on Facebook--lots of teachers' most recent post is +-2006. The third problem is we can all picture a teacher who is Easy, Helpful, Clear, and Popular (even Hot) who still sucks. 

Maybe the problem is not having enough criteria. Perhaps something like:

One thing this well-designed (and copyrighted--before you get any flaky ideas) evaluation map does highlight is the relative complexity of the issues involved. You can see why the Ratemy folks limited themselves to only four qualities. In addition to being a little complex and untidy, the problem is that this flowchart is screening, not diagnosis.

 I’ve recently been educated on the difference between diagnostics and screening. Diagnosis looks at something and figures out whether it is okay or not. Screening looks at something, compares you to others who have that same thing, and then works out what your chances are of some other thing being true, based on what has happened to others with the same thing. At first it sounds like a more complicated version of the first thing, especially if you're inclined to only half pay attention to things--I guess that's everyone I'm talking to, since the others are still trying to work out the flowchart (or the previous sentence). However, it's an important distinction.

Case in point: most pregnant women have two feet, but lots of people who have two feet aren't pregnant. Therefore, feet are strongly correlated to pregnancy, but not a very good factor for screening. For that, you would need to find something that pregnant women share among themselves, but not so often with the rest of us. On the other hand, looking at the fetus itself, or perhaps at a baby emerging from a person, is diagnostic.

It’s important to know which thing you’re looking at because screening and diagnostics have significantly different aims. With screening, you accept errors within a comparatively large margin, because it’s simply a way to decide if a more extensive look is advisable. Getting it wrong is okay if there’s another check down the road to keep people from getting things cut out, or added in, or replaced.

Screening comes not with an answer, but with a guess with a margin-of-error that has been deemed acceptable based on he worst-case scenario for false positive or false negative. In the above example, checking that the mother has feet results in way too high a false positive. Waiting until you can see the baby's feet may be a bit too conservative.

With teacher quality screenings, we need to decide what we are most afraid of:
- Missing good?
- Missing bad?
- Misidentifying good as bad?
- Misidentifying bad as good?

Again, this initially sounds like four things that mean pretty much the same thing, but not so fast. A screening that is perfect at catching every bad teacher will almost be certainly snag some good ones in the net. Similarly, if you want to catch everyone who’s good, you will certainly get some who fake it well. Worse than not catching someone is putting them in the wrong category, as it is a failure of both failing to catch what you’re looking for and catching something you should not have. The evaluation mechanism will almost certainly have a weakness in one of these categories, so be on the lookout.

In order to get to diagnosis, we want hard facts. Data. None of this fluffy complexity and sloppy margin of error. This is why the most commonly floated teacher evaluation method tends to be standardized test scores. Which is why we never really get anywhere in this discussion.

The fundamental question of teacher evaluation, particularly when you're talking to business folks, is what are we trying to produce. Most employees can be evaluated by the work they've done--widgets built, pages written, percent earnings on portfolio compared to index, quarterly profits, etc. Evaluating on the basis of standardized testing is to declare the tests the product. Evaluating on the basis of administrators' opinion is to declare administrators' opinion the product. The problem is, we are trying to manufacture intelligent, informed, thinking people, and do so with a very inconsistent raw material supply, manufacturing conditions, and tools for figuring out how we've done. Otherwise, no problem. 

No comments:

Post a Comment