An Objective Scoring System - Some Suggestions

                    by Quentin D. Thompson 


                    ----------------------------------------------


All the talk about scoring systems in the immediate pre-Comp phase - and

the multiplicity of ratings that have been assigned to different games

following the actual event - have got me thinking about how exactly one

might best rate a game, taking into account....


a. the fact that the game was submitted by an expectant I-F author,

 often a newcomer, who'll look forward to encouraging if not

 overwhelming ratings, and


b. the fact that if I ended up rating games based solely on whether

 I loved them or not, I'd probably end up ignoring the worth of

 a lot of good games out of sheer idiosyncrasy - or perverseness.


The question becomes important not just because I might, one day, end

up judging a future I-F comp, but because also, come results time, there's

often a bewildering multiplicity in the ratings assigned to a game. To

take a well-known example, Lucian Paul Smith's excellent "The Edifice"

(winner of the 1997 I-F Competition) received every possible rating

from 1 (a rating that, to most of us, signifies absolute rock bottom) to 10

(a rating I'd seldom assign to any game: my tables of game rankings have

more than a few 9s, but 10s are exceptionally rare.) 1998 saw little

deviation at the top, thanks to Photopia, though that merely makes me

appreciate the few dissenting voices, such as those of Russell Wallace,

Brandon Van Every and Duncan Stevens. But enough of Photopia, which has

been discussed to death, and more of Competition statistics. In the 1999 IF

Competition, a whopping twenty-two games were assigned at least one '10'

(these included, besides the top twelve, one game that had been certified

unwinnable due to major bugs), and every single game but the eventual

2nd place winner, "For A Change" by Dan Schmidt, received at least one '1'.

Strangely, this hardly aroused any discussion - except at Trotting Krips,

never a page to shy away from facts - but it got me thinking. If people

went around assigning 10s and 1s according to their whims and fancies,

with no regard for the effort (or otherwise) or intentions that went into

each I-F game, the ratings would lose much of their actual value. And it's

not hard to see what might happen next.


Therefore, when I tried, as a test-exercise, to rate this year's Competition

games, I tried not to assign throwaway ratings, as the unknown soldiers

above seem to have done ("Hmmm, cool puzzle....all right, it's a 7." "This

game's full of sucky, literary stuff! Off with its head! Give it a 3."

"Ycch! This sounds postmodern! Bring on the 2s!" "Long games? In a Comp?

Sorry, no can do. Four out of ten." "Who cares if it's buggy? I liked it,

buster. A perfect 10. You can't change my mind." "Oh, dear, a misogynistic

game. It goes against my Ideology, so let's give it a 1." and so on) but

instead tried two approaches to scoring, which I shall discuss below with

the trial results.


So why am I doing this? In the vague, probably bound to be unfulfilled

hope that someone - current judge, future judge, or just I-F lurker - might

have a glance at these ratings - I'm not asking him or her to adopt the

system, just to have a look - and realise that to relegate a game to

the dungheap just because it didn't light their fuse, or give it a full ten

for capricious reasons, is bloody well unsporting and kills the spirit that

the Competition's generally been noted for. (Or, to paraphrase the troll from

my game "Halothane": "NO!! UNFAIR! FOUL! OFFSIDE!")


Approach the First.


My first stab at an objective/subjective rating scheme - that is, one

where both the game's own qualities and my response to them were

represented, was a modification of the system used by C.E. Forman in

his wonderful 1997 Comp reviews. (I don't agree with all that Forman

said in his reviews - I liked Babel, for one, and wouldn't have given

it a 4 even on a bad day - but I admire his spirit.)


Forman's system was deconstructive: each game started with a rating of

10, and lost one point for one of ten offences. There were no decimal

points, and as he stated himself, it was too rigid. It was that

rigidity I tried to avoid when I designed the following scheme, which I

dubbed the Modified C.E. Forman Rating Scheme:


TABLE 1.1. The Modified C.E. Forman Rating Scheme


 1 - Programming errors causing crashes or unwinnability.

 2 - Gross non-fatal programming errors, numbering more than one.

 (In other words, first offenders don't get it in the neck.)

 3 - Reasonable verbs, commands, objects or situations not implemented.

 (This is a wide field, but I shall be tolerant.) Incomplete plot

 development is also included here.

 4 - Atrocious writing; typos, bad grammar, splices, non-sentences.

 [Occasional writing muffs earn -0.5, not -1].

 5 - Content that I personally found annoying.

 (equivalent to C.E.F's 'miscellaneous stuff that pisses me off' :-)

 6 - Unfair/absurd puzzle, situation or game, or technique that

 unduly minimizes player initiative.

 7 - Platform-specific annoyances (Javascript or TADS interpreter bugs,

 bad colour schemes, crashes on my fave interpreter, etc.)

 8 - Scope too wide/narrow for competition. (In other words, a one-room

 game with a silly puzzle, or alternately a Losing Your Grip clone.)

 9 - Lack of a single memorable feature. (I can't allow a game to score

 10/10 if it was bug-free but didn't affect me at all, can I?)

10 - Lack of originality. Including cliches.


 W - Wild card (+0, +0.5 or +1)


There were two ways in which this system was more flexible than Forman's

original idea: first, a little latitude was given to bad writing, and second,

the wild card allowed me to indulge my own foibles, thereby raising the

rating of a game I liked, while not allowing me to unduly penalize a good

game that I simply found unattractive. Of course, I allowed myself one rule:

I could only use the wildcard when a game evoked a strong reaction in me.


Of course, there are holes in this system: some of the headings arguably

overlap, there's too much deduction allowed for technical points, and little

for subjective ones. Still, it's an improvement over the sort of system that,

to quote one example, would rate "Halothane" and "Punkirita Quest" at the

same level. Here's how a not-so-random sample of Comp99 games fared on the

Forman test:


a) The competition winner, "Winter Wonderland" (Comp. avg. 7.40)


Lost one point (heading no.10) for the cliched fairy-tale setting and

St. Nicholas ending, and another point (heading no.5) because I was

bugged by the ice-floe puzzle. Considered a wildcard for moral courage

(i.e. not using Christmas cliches), but decided against it.


Rating: 8 out of 10.


b) The competition funnyman, "Death To My Enemies" (Comp. avg. 4.25)


Lost one point (heading no.6) for guess-the-situation puzzles. I considered

deducting one more (heading no.8) because it wasn't even bite-sized, but

I didn't.


Rating: 9 out of 10.

<P>

c) The competition's "Detective", "Outsided" (Comp. avg. 2.94)


Lost one point (heading no.1) for a major programming crash, one point

(no.2) for numerous non-fatal errors, one point (no.3) for poor object and

NPC implementation, one point (no.4) for having some of the worst

writing since Rybread's glory days, one point (no.5) because typing

XYZZY made me incontinent, one point (no.6) for guess-the-situation puzzles,

and one point (no.10) for a pretty obvious rip-off of the Photopia colour

scheme. I also added a wildcard (+0.5) because of its sheer MiSTability.


Rating: 3.5 out of 10.


d) One of the competition's ambitious games, "Lomalow" (Comp. avg. 4.80)


Lost one point (no.2) because the hint system was broken, one point (no.3)

for jerky plot development and inadequate scenery, one point (no.5)

because the writing and plot were a tad overwrought, one point (no.6)

because typing (non-spoiler equivalent) "ASK MARIE ABOUT CAKE" ten times

isn't my idea of a game, and one point (no.9) because, for a gimmick game,

it really needed a spark.


Rating: 5 out of 10.


e) One of the competition's jokes, "Life On Beal Street" (Comp. avg. 4.41)


Lost one point (no.2) because the CYOA interface printed buggy messages at

times, one point (no. 3) because there weren't any verbs, one point

(no.5) because I don't like CYOA, one point (no.6) for unfair minimization

of interaction, one point (no.8) because it was basically six moves long,

and one point (no.9) because, for a gimmick game, it needed more.


Rating: 4 out of 10.


f) One of the competition's graphical games, "Lunatix" (Comp. avg. 5.64)


Lost one point (no.3) for making me play guess-the-syntax, one point

(no.6) because the puzzles were arbitrary and one point (no.7) because

the parser was AGT-level.


Rating: 7 out of 10.


Now let's step back and see what we've achieved.


In four out of our six samples (Winter, Outsided, Lomalow and Beal Street),

the rating derived by this method came fairly close to the eventual

competition average:


TABLE 1.2 Competition averages .vs. C.E. Forman Scores


Game title Comp. avg. Comp. SD C.E.F. score Deviation


 Gross Adjusted 

Winter Wonderland 7.41 1.83 8 +0.59 +0.59

Death To My Enemies 4.25 2.03 9 +4.75 +2.72

Lomalow 4.80 1.84 5 +0.20 +0.20

Outsided 2.94 1.48 3.5 +0.56 +0.56

Life On Beal Street 4.41 2.22 4 +0.41 +0.41

Lunatix 5.64 2.18 7 +1.36 +0.82


Now, four out of six isn't bad; also, most of the ratings were on the

higher side. In fact, after correcting for S.D., Lunatix can be added to

the list, and five out of six is better, though not perfect. (We'll

define a variation of +1 or -1 as acceptable for this discussion's

purposes.)


Given a suitable degree of cynicism in a judge, it's possible that

one can deduct more from Death (-1 for the small size; -1 for being full

of in-jokes and thus infracting clause no. 3) and bring it down to

around 6 or 7, which is close, but not close enough. Clearly, though, this

system has its points. What it doesn't have is representation: as I said

above, 5 out of those 10 criteria are purely technical, and only one

is purely subjective. What was needed, I felt, is a system that allotted

equal points for both subjective and objective criteria, and that leads

me to the second approach.


Approach the Second.


My second try, and the one I still endorse at present, is a little more

complex, and is perhaps too tedious for what is, after all, a diversion

and not a Summer Olympics or Booker Prize, but I feel it has its points,

and is flexible enough to allow a wide variety of ratings while not

allowing too much crankiness. Without further ado, let me elaborate.


The rating scheme I'm outlining below divides points into two broad

headings: 5 for objective criteria, and 5 for subjective criteria. (I

facetiously refer to them as "Intelligence" and "Goodwill", but that's

just personal terminology, and I don't consider it binding.) Each

criterion contains five sub-headings, each of which can be scored

0, 0.25, 0.5, 0.75 or 1, depending on the judge's opinion. Therefore,

under the heading "Intelligence", there are five sub-headings:


TABLE 2.1 The Quentin D. Thompson O/S Scoring System


a) Story: The basic story of a game, plot development, plot twists,

 sub-plots and so on. 


 Suggested guidelines:


 0: absolutely pointless

 0.25: bare-bones

 0.5: cliched, or original but inadequate

 0.75: cliched but effective, or original and adequate

 1: excellent


b) Writing: prose, sentence flow, grammar, spelling, etc.


 Suggested guidelines:


 0: absolutely terrible. Early Rybread Celsius level. (Outsided)

 0.25: full of typos (Detective)

 grossly overblown 

 poorly executed (Stone Cell)

 0.5: mediocre (Thorfinn's Realm)

 several typos/errors (Human Resources Stories, Yodel)

 over-ambitious (Lomalow)

 0.75: competent (Tapestry)

 good but a few faults (Chix Dig Jerks)

 good but not as engaging as it should be (Worlds Apart)

 1: excellent (Babel, Firebird, Exhibition, Muse)


c) Puzzles


 0: obscure and irritating (Heist, SNOSAE)

 0.25: detract from gameplay, artifically stuck on (TATCTAE)

 0.5: passable (Lunatix, average puzzle-fests)

 0.75: clever (Enlightenment)

 1: excellent, integrated into story (Firebird)


d) Coding


 0: absolutely wretched (Guard Duty)

 0.25: playable only with walkthrough or minor deviations,

 numerous crashes (Four Seconds)

 0.5: Passable, but a few crashes/several minor errors (Chix Dig Jerks)

 0.75: Good, a few fleas (release 1 of Delusions)

 1: Good, no crashes, no serious minor flaws.

 
e) Parser


 0: Commute-level

 0.25: "Yodel"-level

 0.5: AGT-level, ALAN unenhanced

 0.75: AGT Advanced, AGiliTy, TADS, Inform, Hugo, ALAN enhanced

 1.0: Inform, TADS or Hugo with added goodies.


and under "Goodwill", there are another five:


a) Humour


 0: none

 0.25: flashes, or too heavy-handed (Outsided, Yodel)

 0.5: Easter Eggs alone, in-jokes alone (Pass The Banana)

 0.75: Easter Eggs and/or occasional game humour (King Arthur's Night Out)

 1.0: Excellent, integrated into game or just plain hilarious

 (Firebird, Death To My Enemies)


b) Participation


 0: none (In The End)

 0.25: occasional, inadequate 

 0.5: barely adequate or over-ambitious

 0.75: competent (most games)

 1: outstanding (Exhibition, Photopia)


c) Lack of annoyance


 This criterion is purely subjective, so I leave it to any judge to

 grade it as he/she wants.


d) Philosophy/Game Idea


 0: pointless or loathsome (Stiffy Makane, Emy Discovers Life)

 0.25: inadequate (Thorfinn's), trite (Heist) or repulsive

 (Chix for some reviewers - self excluded)

 0.5: average (a cave crawl, say), vague (For A Change)

 0.75: original, good, understandable, not perfect (Tapestry)

 1: outstanding (Exhibition)


e) Wildcard


 Again, call this one as you want. However, it can only be

 0, 0.5 or 1, to minimize randomness.


Having outlined this system in excruciating detail, let's see how

another sample of Comp99 games fared with it.


a) The second place winner, "For A Change" (Comp. avg. 7.25)


Intelligence: 1 for coding and parser, 0.5 for puzzles (they're nothing

_too_ original), 0.75 for writing (clever, but didn't grab me) and 0.5

for story (basically a quest game, but well implemented).


Goodwill: 0 for humour, 0.75 for participation, 0.75 for lack of annoyance

(I did find some of it wearing), 0.75 for game idea, and 1 wildcard

(for originality and new vocabulary.)


Total rating for "For A Change": 7 out of 10.


b) The biggest graphics extravaganza, "Six Stories" (Comp. avg. 7.08)


Intelligence: 1 for coding and parser, 0.75 for the single puzzle (it was

original), 1 for writing, and 0.75 for story (vague but sort of satisfying.)


Goodwill: 0.5 for humour (Easter Eggs and snarky parser replies), 0.5 for

participation, 0.75 for lack of annoyance (it crawled on my machine),

0.5 for game idea, and 1 wildcard (the parser assumptions were outstanding.)


Total rating for "Six Stories": 7.75 out of 10.


c) The Comp's Rybread entry, "L.U.D.I.T.E" (Comp. avg. 2.38)


Intelligence: 0.5 for coding, 0 for puzzles, 0.75 for parser, 0.25 for

writing (he used to be funnier) and 0 for story (absolutely pointless).


Goodwill: 0 for humour, 0.25 for participation, 0.25 for lack of annoyance,

0.25 for game idea, 0 wildcard.


Total rating for "L.U.D.I.T.E": 2.25 out of 10.


d) The Comp's first work of AIF, "Chix Dig Jerks" (Comp. avg. 4.22)


Intelligence: 0.5 for coding, 0.5 for puzzles, 0.75 for parser, 0.75 for

writing, 0.5 for story (intriguing, but abrupt, and the cut-and-paste jars.)


Goodwill: 0.75 for humour, 0.5 for participation, 0.25 for lack of

annoyance (I'm a conscientious objector to AIF), 0.5 for game idea

(vague but intriguing), 1 wildcard.


Total rating for "Chix Dig Jerks": 6 out of 10.


e) The Comp's first IF-MUD in-joke game, "Pass The Banana" (Comp. avg. 3.35)


Intelligence: 0.75 for coding, 0.25 for puzzles, 0.75 for parser, 0.5

for writing, 0 for story.


Goodwill: 0.5 for humour, 0 for participation (unless you're a MUD regular),

0.25 for lack of annoyance (too much wholesale ripping-off of Varicella),

0.25 for game idea (this was more suited to a Mini-Comp) and 0 wildcard.


Total rating for "Pass The Banana": 3.25 out of 10.


f) One of the Comp's short, whimsical pieces, "Calliope" (Comp. avg. 4.67)


Intelligence: 1 coding, 0.5 for puzzles (few and hard), 0.75 parser,

0.75 writing, 0.5 story (it's not very interesting..)


Goodwill: 0.75 for humour (the TV scenes were a scream), 0.75 participation

(I was an IF Comp author myself), 0.75 for lack of annoyance (too short and

guess-the-puzzle), 0.5 for game idea (cute, but not enough), 1 wildcard.


Total rating for "Calliope": 7.25 out of 10.


Now for the lies, damned lies and "statistix":


TABLE 2.2 Competition averages .vs. Quentin D. Thompson O/S Scores


Game title Comp. avg. Comp. SD Q.D.T. score Deviation

 Gross Adjusted 


For A Change 7.25 1.91 7.00 -0.25 -0.25

Six Stories 7.08 1.85 7.75 +0.67 +0.67

L.U.D.I.T.E 2.38 1.83 2.25 -0.13 -0.13

Chix Dig Jerks 4.22 1.88 6.00 +1.78 +0.10

Pass The Banana 3.35 2.15 3.25 -0.10 -0.10

Calliope 4.67 1.85 7.25 +2.58 +0.73


These results, happily, are even more in harmony with what eventually

transpired at the Comp: four out of six are dead matches the first time

around, and with S.D. correction, our score is six out of six. There

are criteria that can be faulted here (for example, the inclusion of

"Humour", or the use of "Parser" to boost the score of a bad game on a

good system), but what we have here is a good starting point.


Conclusion.


Clearly, then, it's possible - by adopting certain measures, either

minimal (the C.E. Forman method) or elaborate (the more accurate Q.D.T

score) - to minimize the amount of quirkiness or crankiness in our

own Comp ratings, and ensure fairer and more uniform scoring. The idea

here is to encourage good IF without discouraging a new author; to

be (in the words of Oscar Hammerstein II) "firm, but kind"; and not

to, based on a whim, a bout of migraine, or whatever, rate an average game

as outstanding, a good one as Rybread-level, and a long string of good

games that you simply can't relate to in the 2-4 bracket. While I don't

expect anyone to take up these systems word-for-word, I hope they'll

at least inspire a little impartial scoring and judging, especially

among this year's "1-2-3" brigade.


Acknowledgements.


Thanks to Stephen Granade for collecting the Comp reviews, all the reviewers

for their insights (most of them have nothing to do with cranky ranking),

Trotting Krips for sparking this debate, Eric Mayer for good sense, and

Roody Yogurt for throwing a spanner in the C.E.F Scale's works.


Copyright 1999, 2000 Quentin D. Thompson.