An Objective Scoring System - Some Suggestions by Quentin D. Thompson ---------------------------------------------- All the talk about scoring systems in the immediate pre-Comp phase - and the multiplicity of ratings that have been assigned to different games following the actual event - have got me thinking about how exactly one might best rate a game, taking into account.... a. the fact that the game was submitted by an expectant I-F author, often a newcomer, who'll look forward to encouraging if not overwhelming ratings, and b. the fact that if I ended up rating games based solely on whether I loved them or not, I'd probably end up ignoring the worth of a lot of good games out of sheer idiosyncrasy - or perverseness. The question becomes important not just because I might, one day, end up judging a future I-F comp, but because also, come results time, there's often a bewildering multiplicity in the ratings assigned to a game. To take a well-known example, Lucian Paul Smith's excellent "The Edifice" (winner of the 1997 I-F Competition) received every possible rating from 1 (a rating that, to most of us, signifies absolute rock bottom) to 10 (a rating I'd seldom assign to any game: my tables of game rankings have more than a few 9s, but 10s are exceptionally rare.) 1998 saw little deviation at the top, thanks to Photopia, though that merely makes me appreciate the few dissenting voices, such as those of Russell Wallace, Brandon Van Every and Duncan Stevens. But enough of Photopia, which has been discussed to death, and more of Competition statistics. In the 1999 IF Competition, a whopping twenty-two games were assigned at least one '10' (these included, besides the top twelve, one game that had been certified unwinnable due to major bugs), and every single game but the eventual 2nd place winner, "For A Change" by Dan Schmidt, received at least one '1'. Strangely, this hardly aroused any discussion - except at Trotting Krips, never a page to shy away from facts - but it got me thinking. If people went around assigning 10s and 1s according to their whims and fancies, with no regard for the effort (or otherwise) or intentions that went into each I-F game, the ratings would lose much of their actual value. And it's not hard to see what might happen next. Therefore, when I tried, as a test-exercise, to rate this year's Competition games, I tried not to assign throwaway ratings, as the unknown soldiers above seem to have done ("Hmmm, cool puzzle....all right, it's a 7." "This game's full of sucky, literary stuff! Off with its head! Give it a 3." "Ycch! This sounds postmodern! Bring on the 2s!" "Long games? In a Comp? Sorry, no can do. Four out of ten." "Who cares if it's buggy? I liked it, buster. A perfect 10. You can't change my mind." "Oh, dear, a misogynistic game. It goes against my Ideology, so let's give it a 1." and so on) but instead tried two approaches to scoring, which I shall discuss below with the trial results. So why am I doing this? In the vague, probably bound to be unfulfilled hope that someone - current judge, future judge, or just I-F lurker - might have a glance at these ratings - I'm not asking him or her to adopt the system, just to have a look - and realise that to relegate a game to the dungheap just because it didn't light their fuse, or give it a full ten for capricious reasons, is bloody well unsporting and kills the spirit that the Competition's generally been noted for. (Or, to paraphrase the troll from my game "Halothane": "NO!! UNFAIR! FOUL! OFFSIDE!") Approach the First. My first stab at an objective/subjective rating scheme - that is, one where both the game's own qualities and my response to them were represented, was a modification of the system used by C.E. Forman in his wonderful 1997 Comp reviews. (I don't agree with all that Forman said in his reviews - I liked Babel, for one, and wouldn't have given it a 4 even on a bad day - but I admire his spirit.) Forman's system was deconstructive: each game started with a rating of 10, and lost one point for one of ten offences. There were no decimal points, and as he stated himself, it was too rigid. It was that rigidity I tried to avoid when I designed the following scheme, which I dubbed the Modified C.E. Forman Rating Scheme: TABLE 1.1. The Modified C.E. Forman Rating Scheme 1 - Programming errors causing crashes or unwinnability. 2 - Gross non-fatal programming errors, numbering more than one. (In other words, first offenders don't get it in the neck.) 3 - Reasonable verbs, commands, objects or situations not implemented. (This is a wide field, but I shall be tolerant.) Incomplete plot development is also included here. 4 - Atrocious writing; typos, bad grammar, splices, non-sentences. [Occasional writing muffs earn -0.5, not -1]. 5 - Content that I personally found annoying. (equivalent to C.E.F's 'miscellaneous stuff that pisses me off' :-) 6 - Unfair/absurd puzzle, situation or game, or technique that unduly minimizes player initiative. 7 - Platform-specific annoyances (Javascript or TADS interpreter bugs, bad colour schemes, crashes on my fave interpreter, etc.) 8 - Scope too wide/narrow for competition. (In other words, a one-room game with a silly puzzle, or alternately a Losing Your Grip clone.) 9 - Lack of a single memorable feature. (I can't allow a game to score 10/10 if it was bug-free but didn't affect me at all, can I?) 10 - Lack of originality. Including cliches. W - Wild card (+0, +0.5 or +1) There were two ways in which this system was more flexible than Forman's original idea: first, a little latitude was given to bad writing, and second, the wild card allowed me to indulge my own foibles, thereby raising the rating of a game I liked, while not allowing me to unduly penalize a good game that I simply found unattractive. Of course, I allowed myself one rule: I could only use the wildcard when a game evoked a strong reaction in me. Of course, there are holes in this system: some of the headings arguably overlap, there's too much deduction allowed for technical points, and little for subjective ones. Still, it's an improvement over the sort of system that, to quote one example, would rate "Halothane" and "Punkirita Quest" at the same level. Here's how a not-so-random sample of Comp99 games fared on the Forman test: a) The competition winner, "Winter Wonderland" (Comp. avg. 7.40) Lost one point (heading no.10) for the cliched fairy-tale setting and St. Nicholas ending, and another point (heading no.5) because I was bugged by the ice-floe puzzle. Considered a wildcard for moral courage (i.e. not using Christmas cliches), but decided against it. Rating: 8 out of 10. b) The competition funnyman, "Death To My Enemies" (Comp. avg. 4.25) Lost one point (heading no.6) for guess-the-situation puzzles. I considered deducting one more (heading no.8) because it wasn't even bite-sized, but I didn't. Rating: 9 out of 10.
c) The competition's "Detective", "Outsided" (Comp. avg. 2.94) Lost one point (heading no.1) for a major programming crash, one point (no.2) for numerous non-fatal errors, one point (no.3) for poor object and NPC implementation, one point (no.4) for having some of the worst writing since Rybread's glory days, one point (no.5) because typing XYZZY made me incontinent, one point (no.6) for guess-the-situation puzzles, and one point (no.10) for a pretty obvious rip-off of the Photopia colour scheme. I also added a wildcard (+0.5) because of its sheer MiSTability. Rating: 3.5 out of 10. d) One of the competition's ambitious games, "Lomalow" (Comp. avg. 4.80) Lost one point (no.2) because the hint system was broken, one point (no.3) for jerky plot development and inadequate scenery, one point (no.5) because the writing and plot were a tad overwrought, one point (no.6) because typing (non-spoiler equivalent) "ASK MARIE ABOUT CAKE" ten times isn't my idea of a game, and one point (no.9) because, for a gimmick game, it really needed a spark. Rating: 5 out of 10. e) One of the competition's jokes, "Life On Beal Street" (Comp. avg. 4.41) Lost one point (no.2) because the CYOA interface printed buggy messages at times, one point (no. 3) because there weren't any verbs, one point (no.5) because I don't like CYOA, one point (no.6) for unfair minimization of interaction, one point (no.8) because it was basically six moves long, and one point (no.9) because, for a gimmick game, it needed more. Rating: 4 out of 10. f) One of the competition's graphical games, "Lunatix" (Comp. avg. 5.64) Lost one point (no.3) for making me play guess-the-syntax, one point (no.6) because the puzzles were arbitrary and one point (no.7) because the parser was AGT-level. Rating: 7 out of 10. Now let's step back and see what we've achieved. In four out of our six samples (Winter, Outsided, Lomalow and Beal Street), the rating derived by this method came fairly close to the eventual competition average: TABLE 1.2 Competition averages .vs. C.E. Forman Scores Game title Comp. avg. Comp. SD C.E.F. score Deviation Gross Adjusted Winter Wonderland 7.41 1.83 8 +0.59 +0.59 Death To My Enemies 4.25 2.03 9 +4.75 +2.72 Lomalow 4.80 1.84 5 +0.20 +0.20 Outsided 2.94 1.48 3.5 +0.56 +0.56 Life On Beal Street 4.41 2.22 4 +0.41 +0.41 Lunatix 5.64 2.18 7 +1.36 +0.82 Now, four out of six isn't bad; also, most of the ratings were on the higher side. In fact, after correcting for S.D., Lunatix can be added to the list, and five out of six is better, though not perfect. (We'll define a variation of +1 or -1 as acceptable for this discussion's purposes.) Given a suitable degree of cynicism in a judge, it's possible that one can deduct more from Death (-1 for the small size; -1 for being full of in-jokes and thus infracting clause no. 3) and bring it down to around 6 or 7, which is close, but not close enough. Clearly, though, this system has its points. What it doesn't have is representation: as I said above, 5 out of those 10 criteria are purely technical, and only one is purely subjective. What was needed, I felt, is a system that allotted equal points for both subjective and objective criteria, and that leads me to the second approach. Approach the Second. My second try, and the one I still endorse at present, is a little more complex, and is perhaps too tedious for what is, after all, a diversion and not a Summer Olympics or Booker Prize, but I feel it has its points, and is flexible enough to allow a wide variety of ratings while not allowing too much crankiness. Without further ado, let me elaborate. The rating scheme I'm outlining below divides points into two broad headings: 5 for objective criteria, and 5 for subjective criteria. (I facetiously refer to them as "Intelligence" and "Goodwill", but that's just personal terminology, and I don't consider it binding.) Each criterion contains five sub-headings, each of which can be scored 0, 0.25, 0.5, 0.75 or 1, depending on the judge's opinion. Therefore, under the heading "Intelligence", there are five sub-headings: TABLE 2.1 The Quentin D. Thompson O/S Scoring System a) Story: The basic story of a game, plot development, plot twists, sub-plots and so on. Suggested guidelines: 0: absolutely pointless 0.25: bare-bones 0.5: cliched, or original but inadequate 0.75: cliched but effective, or original and adequate 1: excellent b) Writing: prose, sentence flow, grammar, spelling, etc. Suggested guidelines: 0: absolutely terrible. Early Rybread Celsius level. (Outsided) 0.25: full of typos (Detective) grossly overblown poorly executed (Stone Cell) 0.5: mediocre (Thorfinn's Realm) several typos/errors (Human Resources Stories, Yodel) over-ambitious (Lomalow) 0.75: competent (Tapestry) good but a few faults (Chix Dig Jerks) good but not as engaging as it should be (Worlds Apart) 1: excellent (Babel, Firebird, Exhibition, Muse) c) Puzzles 0: obscure and irritating (Heist, SNOSAE) 0.25: detract from gameplay, artifically stuck on (TATCTAE) 0.5: passable (Lunatix, average puzzle-fests) 0.75: clever (Enlightenment) 1: excellent, integrated into story (Firebird) d) Coding 0: absolutely wretched (Guard Duty) 0.25: playable only with walkthrough or minor deviations, numerous crashes (Four Seconds) 0.5: Passable, but a few crashes/several minor errors (Chix Dig Jerks) 0.75: Good, a few fleas (release 1 of Delusions) 1: Good, no crashes, no serious minor flaws. e) Parser 0: Commute-level 0.25: "Yodel"-level 0.5: AGT-level, ALAN unenhanced 0.75: AGT Advanced, AGiliTy, TADS, Inform, Hugo, ALAN enhanced 1.0: Inform, TADS or Hugo with added goodies. and under "Goodwill", there are another five: a) Humour 0: none 0.25: flashes, or too heavy-handed (Outsided, Yodel) 0.5: Easter Eggs alone, in-jokes alone (Pass The Banana) 0.75: Easter Eggs and/or occasional game humour (King Arthur's Night Out) 1.0: Excellent, integrated into game or just plain hilarious (Firebird, Death To My Enemies) b) Participation 0: none (In The End) 0.25: occasional, inadequate 0.5: barely adequate or over-ambitious 0.75: competent (most games) 1: outstanding (Exhibition, Photopia) c) Lack of annoyance This criterion is purely subjective, so I leave it to any judge to grade it as he/she wants. d) Philosophy/Game Idea 0: pointless or loathsome (Stiffy Makane, Emy Discovers Life) 0.25: inadequate (Thorfinn's), trite (Heist) or repulsive (Chix for some reviewers - self excluded) 0.5: average (a cave crawl, say), vague (For A Change) 0.75: original, good, understandable, not perfect (Tapestry) 1: outstanding (Exhibition) e) Wildcard Again, call this one as you want. However, it can only be 0, 0.5 or 1, to minimize randomness. Having outlined this system in excruciating detail, let's see how another sample of Comp99 games fared with it. a) The second place winner, "For A Change" (Comp. avg. 7.25) Intelligence: 1 for coding and parser, 0.5 for puzzles (they're nothing _too_ original), 0.75 for writing (clever, but didn't grab me) and 0.5 for story (basically a quest game, but well implemented). Goodwill: 0 for humour, 0.75 for participation, 0.75 for lack of annoyance (I did find some of it wearing), 0.75 for game idea, and 1 wildcard (for originality and new vocabulary.) Total rating for "For A Change": 7 out of 10. b) The biggest graphics extravaganza, "Six Stories" (Comp. avg. 7.08) Intelligence: 1 for coding and parser, 0.75 for the single puzzle (it was original), 1 for writing, and 0.75 for story (vague but sort of satisfying.) Goodwill: 0.5 for humour (Easter Eggs and snarky parser replies), 0.5 for participation, 0.75 for lack of annoyance (it crawled on my machine), 0.5 for game idea, and 1 wildcard (the parser assumptions were outstanding.) Total rating for "Six Stories": 7.75 out of 10. c) The Comp's Rybread entry, "L.U.D.I.T.E" (Comp. avg. 2.38) Intelligence: 0.5 for coding, 0 for puzzles, 0.75 for parser, 0.25 for writing (he used to be funnier) and 0 for story (absolutely pointless). Goodwill: 0 for humour, 0.25 for participation, 0.25 for lack of annoyance, 0.25 for game idea, 0 wildcard. Total rating for "L.U.D.I.T.E": 2.25 out of 10. d) The Comp's first work of AIF, "Chix Dig Jerks" (Comp. avg. 4.22) Intelligence: 0.5 for coding, 0.5 for puzzles, 0.75 for parser, 0.75 for writing, 0.5 for story (intriguing, but abrupt, and the cut-and-paste jars.) Goodwill: 0.75 for humour, 0.5 for participation, 0.25 for lack of annoyance (I'm a conscientious objector to AIF), 0.5 for game idea (vague but intriguing), 1 wildcard. Total rating for "Chix Dig Jerks": 6 out of 10. e) The Comp's first IF-MUD in-joke game, "Pass The Banana" (Comp. avg. 3.35) Intelligence: 0.75 for coding, 0.25 for puzzles, 0.75 for parser, 0.5 for writing, 0 for story. Goodwill: 0.5 for humour, 0 for participation (unless you're a MUD regular), 0.25 for lack of annoyance (too much wholesale ripping-off of Varicella), 0.25 for game idea (this was more suited to a Mini-Comp) and 0 wildcard. Total rating for "Pass The Banana": 3.25 out of 10. f) One of the Comp's short, whimsical pieces, "Calliope" (Comp. avg. 4.67) Intelligence: 1 coding, 0.5 for puzzles (few and hard), 0.75 parser, 0.75 writing, 0.5 story (it's not very interesting..) Goodwill: 0.75 for humour (the TV scenes were a scream), 0.75 participation (I was an IF Comp author myself), 0.75 for lack of annoyance (too short and guess-the-puzzle), 0.5 for game idea (cute, but not enough), 1 wildcard. Total rating for "Calliope": 7.25 out of 10. Now for the lies, damned lies and "statistix": TABLE 2.2 Competition averages .vs. Quentin D. Thompson O/S Scores Game title Comp. avg. Comp. SD Q.D.T. score Deviation Gross Adjusted For A Change 7.25 1.91 7.00 -0.25 -0.25 Six Stories 7.08 1.85 7.75 +0.67 +0.67 L.U.D.I.T.E 2.38 1.83 2.25 -0.13 -0.13 Chix Dig Jerks 4.22 1.88 6.00 +1.78 +0.10 Pass The Banana 3.35 2.15 3.25 -0.10 -0.10 Calliope 4.67 1.85 7.25 +2.58 +0.73 These results, happily, are even more in harmony with what eventually transpired at the Comp: four out of six are dead matches the first time around, and with S.D. correction, our score is six out of six. There are criteria that can be faulted here (for example, the inclusion of "Humour", or the use of "Parser" to boost the score of a bad game on a good system), but what we have here is a good starting point. Conclusion. Clearly, then, it's possible - by adopting certain measures, either minimal (the C.E. Forman method) or elaborate (the more accurate Q.D.T score) - to minimize the amount of quirkiness or crankiness in our own Comp ratings, and ensure fairer and more uniform scoring. The idea here is to encourage good IF without discouraging a new author; to be (in the words of Oscar Hammerstein II) "firm, but kind"; and not to, based on a whim, a bout of migraine, or whatever, rate an average game as outstanding, a good one as Rybread-level, and a long string of good games that you simply can't relate to in the 2-4 bracket. While I don't expect anyone to take up these systems word-for-word, I hope they'll at least inspire a little impartial scoring and judging, especially among this year's "1-2-3" brigade. Acknowledgements. Thanks to Stephen Granade for collecting the Comp reviews, all the reviewers for their insights (most of them have nothing to do with cranky ranking), Trotting Krips for sparking this debate, Eric Mayer for good sense, and Roody Yogurt for throwing a spanner in the C.E.F Scale's works. Copyright 1999, 2000 Quentin D. Thompson.