Monday, March 4, 2019

Advice to former self #3: Grading is overrated

I don’t know about you, but grading used to absolutely terrify me. Maybe that’s because of my unproductive perfectionism, literal thinking, and a tendency to complicate things, but I just could not get grading right. I mean, one can argue whether “right” is ever achievable, but I like to think that for most topics out there, I know at least a general direction towards what constitutes “better”, and so I can try to set on a decent trajectory. For point-based grading, it was always different. There seems to be no “ideal grading”, no win-win solution. If you consider all possible ways to grade, you’ll get a weird optimization landscape, where all extreme cases are just plain horrible, and somewhere in the middle sits a mediocre maximum of “least painful grading”. It is a depressing, Leibniz-style philosophy: the best grading is the one that makes everyone about equally unhappy, and the reasons for that unhappiness are as diverse as possible (that is, the mean(gradient)==0, which in practice means that different students should complain about different things, without any single common theme).

Let me elaborate, and let me start with the most basic question: why do we grade? I can probably come up with three main reasons: 1) we want to give students some feedback; 2) we want to loan them a bit of our willpower, to help them do the work, by providing some external motivation, and finally (3) we want to make sure that good students are rewarded with signalling tokens, such as a good GPA (some people may call it “justice”, or “fairness”). Even if you don’t believe in objective justice, or your ability to discern it (and I certainly have strong doubts here), we still want to trust young doctors who will operate on us in 20-30 years from now, right? Which means that I want to reward good students, rather than bad ones. So it all boils down to feedback, coaching, and fairness.

Now, if we try to translate these “goals” into practical criteria of a “good grading system”, we can probably capture the essence of it with five guiding principles:

  1. Grades should be informative (to serve as productive feedback)
  2. They should be encouraging (to help coaching)
  3. They should assess something that matters; something that is relevant (to be fair)
  4. They should be quantifiable, measurable (again, to be fair)
  5. Grading process should be time-efficient (time is limited, and we want to maximize impact)

The problem with these statements is that they are all, to some degree, contradictory. To be truly informative, grades should be brutally honest, but this would make them extremely disheartening. For example, if you only grade on a pass/fail basis, and each assignment can only be attempted once, then to achieve highest information transfer, you would need to adjust your grading criteria for every student, to maintain an average failure rate of 50%. Can you imagine what would happen to a human if they keep failing at a 50% rate whenever they do? They will probably quit, won’t they? I am not sure what is the most “encouraging” rate of failures, but judging from computer games, it should be at about 10%, just to give it a bit of spice, while still keeping it safe. Which of course would make for a very inefficient training system.

Or consider another point: WHAT do we grade? Is it the final performance, the growth, or the effort? Final performance is easy to quantify (objective, measurable), but it is fundamentally unfair, as in every class, and especially in skill-based classes like math and CS, some students would start with a baggage of transferable skills, and some will have to catch up a lot. And it does not just “feel” unfair; grading of the final product is fundamentally misplaced, as it does not measure any relevant skills of each student: their ability to learn, to grow, to persevere, or their “true potential” (whatever it means, as “true potential” is of course unmeasurable by definition). So grading by performance is bad.

To battle this issue, you may be tempted to grade growth, but this approach contradicts the principle #5 of time efficiency. To reliably measure a slope you have to have 3-4 times more point-estimations than if you only measure the final product. Slope is just a very noisy thing to assess, and so, again, we are caught in a contradiction. Either our estimations of “growth” are so noisy that it makes them irrelevant, and thus unfair, or we spend all of our time on grading, which warps our curriculum, and hurts our teaching and research.

What options are left? We can try to grade on effort, as a time-efficient proxy for growth. But then again, effort can be gamed more easily than any other measure, and also effort is actually a rather bad proxy for growth, as efforts may be so easily misplaced (through inefficient work, procrastination, etc.).

In practice most people I know use a system that somehow combines these three aspects into one “index”. Say, 40% of the grade comes from a final exam (product), 30% from participation (essentially, effort), 30% comes from labs, with 2 worst labs dropped (essentially - growth). Each part of this equation on its own is unfair, but the reasons are different, and we hope that the final formula is more-or-less OK “on average”.

But then we run into a yet different issue: students are not good in understanding grading rubrics! Imagine somebody teaching calculus one, and having a grading rubric that essentially uses formulas like grade = a*x1 + b*max(x2,x3), where each of the x values is also somewhat curved. No wonder students never understand grading rubrics! As a result, half of them give up on easy but critical assignments (which earns them a C), and the other half come to your office hours arguing about a 1 point on a 20-point daily quiz, which translates to something like 0.00002 of their final GPA. They just don’t get it! I experienced it first-hand in my second year of teaching: for several months I was working on my rubric, hoping to make it fair, objective, and transparent. By the end of semester I learned that all of my students were convinced that my grades were completely subjective, and took into account my personal guesses about the intrinsic “worth” of each person. In other words, they assumed the exact opposite of what I was telling them (or at least what I thought I was telling them), and what was formally written on the syllabus. Because they didn’t understand it.

Ironically, it means that the more balanced your grading system is, the less transparent it seems to students, which makes them anxious, and violates requirement #2, as at this point grades no longer serve as a good encouragement.

And contradictions don’t even end here! We have yet another one, about curving grades, vs. using fixed thresholds. If you have adjustable thresholds for As and Bs (what point-score corresponds to each letter grade), you can change assignments from one semester to another (which is good), or even on the fly (even better!), and also you can adjust your criteria if a snow day steals a lecture, or you get sick and fail to explain something well enough. You can correct your grading! But as a payment for that, students get really upset, as they are unable to translate their tests results into letter-grades, which obviously increases anxiety, and decreases their performance. There is a solution for that: curving, where you give a fixed share of As, Bs, etc. But if you officially introduce curving on your syllabus, students start competing with each other, which ruins course dynamics, kills collaborative assignments, and makes everyone unhappy. So no luck here as well. As Ken Bain describes in his famous book, good grading is always somewhere in-between curving and fixed thresholds.

There’s also a question of how hard would you make your assignments, given that students come to class with different skills, and different abilities. Should you serve top 10%, and explicitly give up on the bottom 50%? (That’s how Soviet model of STEM education worked: recruit the best, let the rest die). Or should you aim at about the median student? (American system). Or would you go for the lower third? (Don’t they do it in Finland?). Whichever option you choose, some students will complain that the course is going too fast, and some - that it is going too slow. And all you can do is to make sure that the proportion of complaints is “on target”, whichever your target is (in the US usually 50/50).

To sum up: grading is horrible, unpleasant, and imperfect by design. That’s probably why everyone hates it. (see also: https://rtalbert.org/traditional-grading-the-great-demotivator/ )

So, how to fix grading? Actually, at this point it seems that I have a very good solution (I wish I would have found it earlier!), which was described in 2015 by Linda Nilson, and is in fact nothing short of revolutionary! But that will be a topic of a separate (next) post!!