Games and Grades

Sep 9, 2015 ∞

Over the last few years there has been a fresh kerfuffle over game reviews frequently enough that they have all more or less rolled together into one ongoing undulation of confusion over what any of it means. Many trends are at fault, from a general confusion between product reviews and criticism, to the unhealthy way in which Metacritic scores rule the lives of many studios, but beyond all of that, I think, is something deeper. People simply don’t understand what grades mean.

In the most recent incarnation, sections of the internet are upset over the review scores given to Mad Max. Why this is the game people are focusing on is lost on me. I’m not sure why anyone should be surprised that a licensed game coming on the heals of a hit movie should receive mediocre reviews, but apparently some people are upset that Polygon’s Phil Kollar gave it a 5/10.

The crux of this Forbes article by Paul Tassi is that while 5/10 might sound like half-way between the worst possible score and the best, the real scale goes from 50% (F) to 100% (A). Basically Tassi is arguing that Polygon is that one annoying teacher who rails about grade inflation, insisting that C is average no matter how many smart students there were in AP Chem this year, and no I don’t care how you’re going to explain that on your college applications.

The problem with this is that it gets grading completely wrong. Grades expressed as a percentage are not directly comparable to letter grades, and you can’t force a direct translation between the two unless a conversion is explicitly given. They mean totally different things. And yet, I remember having this exact same argument about grades when I was in school.

When you get 80% on a test, that means something specific, namely that out of 100 (or however many) questions, the grader scored you as correct on 80 of them and incorrect on 20. That 80% of the questions answered correctly often translates to a B grade is the case only because that is a traditional target used by many teachers over the years when designing tests. But not all tests are equal. What if you wrote a 10 question test with two questions designed so that you doubt anyone in the class could answer them. You might rarely get a right answer on them, but when you do, it would be informative, wouldn’t it? If you don’t expect anyone in a given year to be able to answer those two questions, is it fair to cap your expected grade at a B with As distributed only the rare occasion of a stand out student? No. In this case you adjust the scale to your expectations with 80% becoming an A.¹

Another test might ask questions where getting any one wrong would say something seriously troubling about your readiness to move on. Think of a driving test or some other certification. Sure you did 8 of the basic maneuvers right, but you blew through that red light and crunched the fender of the car in front when parking. That doesn’t sound like a B.

The trouble is, we used to think we could address game reviews this way. (Well, they got 8/10 possible points on graphics and 9/10 points for “gameplay”, so…) But judging a game is, at best, more like grading a paper. The criteria are so many and varied (and yes, personal) that you can’t spell it all out in terms of questions answered right or wrong out of a total. You just skip right to the letter grade. And that’s what outfits like Polygon do. If you look at how things shake out, it works pretty well. If 0 is an F and 2 is a D, 5 is a C+ or so, which sounds about right for a game where there’s something there, but not enough.

If Metacritic is set up to evaluate game reviews as if they’re ticking off right or wrong answers on an exam and interprets a 5 as getting half wrong, I think I see where the problem is, and it’s not with Polygon’s review scores.

Update: I should note that the perception problem that Tassi is pointing out is real, just as grade inflation countermeasures like my tongue in cheek reference to a teacher giving abnormally low grades can be. But the problem here is with Metacritic. If people have accepted a standard that 50% on Metacritic might as well be 0%, they need to find a way to apply a curve to translate the scoring systems of different reviewers so that if Polygon’s 1 = F and Metacritic’s 50% = F, Polygon’s 5 doesn’t get interpreted by Metacritic as a 50%.

Yes, I am aware that some schools give number grades rather than letter grades. This confuses the issue because you now have two different percent scales, the final grade scale, representing an abstract concept about how well you did in a class, and the raw score scale that tells you how many questions you got right out of a total. This is where curves come into effect, translating the one number into the other so that if getting 60% of all possible points in a class is actually doing rather well, the scores get stretched and mapped so that 60% score becomes an 89% grade, just like it might have become a B+. ↩