Jumping Conclusions Scientifically: Reading IF Reviews

As a boy, when I learned the difference between induction and deduction, I was deeply impressed, and went looking for instruction on how to do induction.  Everybody knows how to do deduction:  Socrates is a man; all men are mortal; therefore, Socrates is mortal.  — But where do you get the rules?

You get them, of course, from induction.  But all the material I found on induction was really stupid.  One explained that, you look at Mercury and determine it’s sphereoidal; and at Venus, and determine it’s sphereoidal; and so on to Pluto; and from this you determine “inductively” that all planets are sphereoidal.

Which is useless, of course.

John Stuart Mill, of intro philosophy course fame for his ethical theory, identified and formalized the rules we intuitively use to work from specific cases to general ones.  Get good at them and you can work with fuzzy, non-quantifiable data scientifically. 

These are the basic rules that Jared Diamond used to organize his historical observations in Guns, Germs, and Steel.  I’m writing them up to encourage you to use them for cross-comparison of IF Comp reviews this year.

Key.  We’ll have A, B, C, D, E, F, G refer to properties of the game being reviewed, and t, u, v, w, x, y, z refer to opinions of the reviewer.  The question is, what game properties reliably elicit what reviewer opinions.

Agreement.  If two or more reviewer opinions have only one game property in common, and games with that property are invariably given that opinion, then that game property causes that opinion.

  1. A, B, C, D -> w, t, u, v; and
  2. A, E, F, G -> w, x, y, z; then
  3. A -> w

For example, if we look at a number of games that several reviewers have considered “highly immersive,” and notice that they all have an introduction with a strong narrative hook, and there are no games with a strong narrative hook that reviewers have said were not immersive, then we’ll conclude that the narrative hook causes reviewers to consider the game immersive.

Difference.  If two games are identical except one has and the other lacks a certain property, and the game that has the property is given a particular opinion, which the other game is not, then the game property causes reviewers to have that opinion.

  1. A, B, C, D -> w, x, y, z, and
  2. B, C, D -> x, y, z, then
  3. A -> w

For example, let’s say we have two remarkably similar games.  One has a well-characterized PC and the other does not.  Reviewers consistently say that the first game gives a good sense of player agency while the second does not.  From this we will conclude that PC characterization causes reviewers to have a strong sense of agency.

Joint Agreement and Difference.  This is just the two above methods applied together.

Covariation.  This is an analog version of Agreement and Difference.  If one game has a little of some property, and reviewers have a certain mild opinion about it, and another game has a lot of the same property, and reviewers have a much stronger form of the same opinion about this game, then we will conclude that this property causes reviewers to have that opinion.

  1. A, B, C -> x, y, z, and
  2. AA, B, C -> xx, y, z, then
  3. A -> x

For example, let’s say one game has no puzzles, and is considered “okay.”  Another game has easy puzzles, and is considered “good.”  A third game has tricky puzzles, and is considered “bad.”  A fourth game has killer difficult puzzles, which are well-hinted, and is considered “very good.”  If we order these games by how soluble the puzzles were, we find we’ve also ordered them by how good reviewers said they were.  Therefore, we will conclude that reviewers consider good games to have soluble puzzles.

Residues.  This is incredibly useful.  If a game has a complex set of properties, and a reviewer has a complex set of opinions, we can cancel out the known patterns of cause and effect, looking only at those game properties and opinions we don’t have maps for.

  1. A, B, C, D, E -> v, w, x, y, z 
  2. B -> w is known
  3. C -> x is known
  4. D -> y is known, and
  5. E -> z is known, then
  6. A -> v

For example, let’s say we have a game with well-hinted puzzles of moderate difficulty, a generic PC, several chatty NPCs, and a strong narrative hook; and we want to compare this to another game with no narrative hook, a strongly characterized PC, and extremely difficult puzzles with no hints.  Reviewers say the first game is pretty good because it’s immersive and realistic, while the second is not too good, because, although it has a good sense of agency, it lacks immersiveness.

Looking at the patterns of cause and effect we’re pretending we found in prior examples, we say:  The first game is “pretty good” because it had well hinted puzzles; it’s immersive because it had a strong narrative hook; and we don’t know why it’s “realistic.”  The second game is “not too good” because the puzzles weren’t well hinted enough; it lacked immersiveness because it lacked a narrative hook; it has a good sense of agency because it had a well developed PC. 

Ignoring all these knowns, we see that the first game featured conversation with several NPCs, and was considered “realistic,” while the second did not and was not.  Now we apply the method of difference, to conclude that conversation with NPCs causes reviewers to call a game “realistic.”

Other Methods

There’s one other thing to look at that J. S. Mill doesn’t talk about.  This is not too useful in comparing reviews, but it’s very useful when reading a commented transcript, or a review written real-time.

Timing.   If the game does something new, and the player’s attitude changes, then the thing the game did caused the change in attitude.

(The other one that’s useful is covariation:  if the game does a little of something, and the player has a small change in attitude, and later the game does it a lot more, and the player has a big change in attitude, then that thing the game is doing is driving the attitudinal change.)

These may seem obvious, but they can take a lot of mystery out of player responses.  When a player says or does something remarkable — decides to stop playing the game, for example — look at what the game just did.  Did it just symbolically abuse an NPC the player might have been in sympathy with?  –and so on.

The problem with timing is that it’s an over-simplification.  We very often build up a response over time, until suddenly something happens to trigger it.  What’s happening here is that the prior game events are building up a particular response potential — they’re drops gradually filling the player’s glass, which the last game event tips over.

I may be away-from-computer during the IF Comp this year, in which case I won’t be able to write up my reviews of reviews.  That’s all right; if you’re interested in unearthing patterns of cause and effect, game design-wise, you can do it yourself.  You’re qualified — this is all stuff we know how to do automatically; we just don’t always know when to do it.

Advertisements

The URI to TrackBack this entry is: https://onewetsneaker.wordpress.com/2010/09/22/jumping-conclusions-scientifically-reading-if-reviews/trackback/

RSS feed for comments on this post.

12 CommentsLeave a comment

  1. You’re pushing ALL of my “correlation is not causation” buttons here. Applying Mills’ induction to social science without a giant helping of salt is going to cause all kinds of problems. I don’t think there’s anything wrong with observing that good characterization makes people feel more invested in their characters, but I don’t think it follows your logic chain.

  2. Yeah, I agree with Gravel — it’s certainly quite plausible that a game that gives a player a thorough sense of agency might lead to the perception that the protagonist was well-characterized, rather than the other way around.

    I still think the analyses of game reviews you do will be very informative, but we should take care not to make sweeping claims about which properties of gaming experience are fundamental and which are derived without supporting data and analysis.

  3. I have never read Mill’s book myself, but I am under the impression that he is describing the inductive reasoning you use in scientific experimentation — that is, in a situation where you can actually fiddle with the input and observe changes in the output. This would address Gravel’s and Matt’s worry, but also make it somewhat harder to apply to the IF Comp reviews. :)

    Nevertheless, correlation is surely almost as interesting as causation?

    It seems to me that the bigger problem is the one which I, when I’m lecturing on scientific method, would stress at this point: you’ve got to come up with the concepts A, B, C, D, w, x, y and z, and you cannot do this in a neutral, theory-independent way, which means that you’re always stuck in a certain mindset and might be missing out on the really important stuff that requires a conceptual leap.

    But again: even finding correlations between a set of non-ideal descriptions would be very interesting.

  4. Gravel,

    The correlation-causation problem is, of course, inherent to the question of induction! Mill talks about it quite a bit. The way I deal with it when I teach Mill’s methods is to say that the resulting inference is a hypothesis that we should then test.

    Mill’s discussion is far more thorough than I can get into in a blog post. One of the more interesting points he makes is about what I’ll call the domain of observation influencing the certainty of our inferences.

    For example, if we combine two chemicals and observe a reaction, we’re pretty willing to accept as true that those two chemicals cause that reaction to happen — after only one observation.

    On the other hand, we’re remarkably reluctant to accept as true a statement like, “All crows are black,” even after observing many thousands of crows.

    Mill asks, Why is that? — but doesn’t answer. !-)

    Matt,

    Your cautionary advice is well-taken. This is basically the approach I took to last year’s review of Comp reviews. You just pepper in a few “perhaps”‘s, “it seems that”‘s, and “apparently”‘s and you can get away with all sorts of things.

    But, in fact, I’m hoping you guys will start collating the information from reviews this way since, as I say, I may unavoidably be Away From Computer this season.

    Victor,

    Actually, Mill’s methods are a major way of making inferences — I want to say they’re the only way of making inferences — even in non-experimental, strictly observational sciences. But this is an entirely separate issue from the ability to collect quantifiable data.

    For example, until nuclear physics, astronomy has been a strictly observational science. But nevertheless, astronomers could collect hard data. And this passively observed hard data was enough to identify, for example, the main sequence, which was accomplished strictly through the …mmm… Mill’s methods-esque analysis of that data.

    In contrast, we also often have domains of observation where we can exert control, but cannot collect well-quantified data. For example, a few decades ago psychologists suggested to subjects (under hypnosis) that they have certain dreams, and report them during the next session. Then they wrote down what dreams the subjects reported having. And those results are illuminating; but they don’t make it a hard science.

    (I’m a big believer in Mill’s methods, because I applied them to dream interpretation — no hypnosis; just working with reports — for good results.)

    As for your point about knowing what to look for and what concepts to apply — well, yeah! Otherwise, it’d be easy!

    Conrad.

  5. clarification: The examples given are just examples. They’re meant to be not offensive to reason and to illustrate the inductive forms.

    I have no stake in whether PC characterization really does cause reviewers to praise a game’s immersiveness. There might or might not be a relation between the two.

  6. Mill’s discussion is far more thorough than I can get into in a blog post. One of the more interesting points he makes is about what I’ll call the domain of observation influencing the certainty of our inferences.

    For example, if we combine two chemicals and observe a reaction, we’re pretty willing to accept as true that those two chemicals cause that reaction to happen — after only one observation.

    On the other hand, we’re remarkably reluctant to accept as true a statement like, “All crows are black,” even after observing many thousands of crows.

    Mill asks, Why is that? — but doesn’t answer. !-)

    That’s pretty interesting. I wonder how much our confidence in the inference is due to how much prior knowledge we have about what is actually going on. For example: would someone with no knowledge of chemistry, when observing a dramatic reaction when a red liquid added to a blue liquid, assume that _any_ red liquid and blue liquid would cause a similar reaction? Or would they be less certain of such a result due to less knowledge of what actual substances are participating in the reaction?

    Are Mill’s rules addressed in a particular book? I didn’t see a reference above.

  7. Correlation may be interesting, but it’s not particularly scientific. For example, divorce rates in the US are strongly inversely correlated with the number of sheep in a given county. The more sheep you have, the lower the divorce rates. You *might* be able to extrapolate something from that, or it might point you in another direction, but it’s not causation, nor is it a recommendation to buy a bunch of sheep.

    And that’s using quantitative data with good sampling and spread.

    I’d like to see way, way more base data before drawing hard conclusions from transcripts or reviews. Just to take a silly example – how long it takes a player to quit. Transcripts aren’t even timed – there’s no way to know if the player has been going for two hours already, or if it’s dinnertime, or if players normally play half an hour and then stop, or any number of other variables. Without a controlled environment or historical data, anything pulled out is qualitative, not quantitative, which is FINE, it’s just not enough to do analysis with.

    Social scientists do do literature and language surveys that try to extrapolate data from documents, but that generally requires sampling on orders of magnitude greater than the entire IF database, much less the handful of reviews for Comp.

  8. A System of Logic, Ratiocinative and Inductive. See book 3, chapter 3: On the Grounds of Induction for the discussion on variable-colored crows.

    It’s available as a free ebook via Google or Project Gutenburg. Keep in mind it was written in the 1800s, so you’ll come across, for example, references to ‘the region of the fixed stars,’ and so forth.

    The Gutenberg edition.

    @Victor – rereading a bit of Mill’s book, I see you’re right, in that he talks about the modern scientific idea of ‘interrogating nature,’ usually through experiment, as opposed to the more naive passive method of observing what nature offers.

  9. I’d like to see way, way more base data before drawing hard conclusions from transcripts or reviews. Just to take a silly example – how long it takes a player to quit. Transcripts aren’t even timed…

    Hey, I’d like to see any number of things. I mean — yes, you’re right: but the data we have are the data we have.

    Work with what we have. When we get better data we’ll revise our conclusions. Until then, we’ll forgo the grant money.

  10. Matt: “I wonder how much our confidence in the inference is due to how much prior knowledge we have about what is actually going on.”

    A lot, I think, and a lot of framing. That is, most of us don’t see chemicals combining on an everyday basis – we tend not to see isolated events happen at all. We see crows in their natural environment – pieces of the whole, but rarely something that looks complete. And often if you look closer, it’s not complete after all. Fruit flies don’t spontaneous generate from bananas; disease can come from misfolded proteins; stars aren’t fixed; the universe isn’t fixed! It’s almost always more complicated than that (for any given value of “that”).

  11. Everyone seems so ready to jump at Conrad… I say relax. I don’t think he’s proposing we take any of the conclusions from his proposal as cold, unchangeable fact. I think his experiments, whatever you make of them, do good things for the IF community.

  12. Gravel… the purpose here is usefulness in relation to understanding art. It is not to arrive, even over several blog posts, at Certainty and Truth.

    Usefulness does fine.

    These posts are written for IF authors. Now, which is more useful to them: waiting until we have enough data that we can make statistically iron-clad inferences, or using cross-comparison and testing to develop ideas about what makes IF go, with the understanding they won’t be flawless?

    In your simulator thing that you’re building, you’re very careful to get details right. You just posted something about simulating plant growth rates according to temperature. Now, my question is this —

    When you get around to putting stars in your simulation, as I assume you will — are you going to model thousands of nuclear furnaces at vast simulated distances from your landscape? –Or are you going to render a bitmap a few clicks over the player’s head?

    Because if you do the second, I want to encourage you to call that bitmap, “region_of_fixed_stars.bmp”


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s