Can Winograd Schemas Replace Turing Test for Defining Human-Level AI?

The Turing Test is a flawed metric for AI, and now we’ve got something better

4 min read

Can Winograd Schemas Replace Turing Test for Defining Human-Level AI?
Illustration: Getty Images

Earlier this year, a chatbot called Eugene Goostman “beat” a Turing Test for artificial intelligence as part of a contest organized by a U.K. university. Almost immediately, it became obvious that rather than proving that a piece of software had achieved human-level intelligence, all that this particular competition had shown was that a piece of software had gotten fairly adept at fooling humans into thinking that they were talking to another human, which is very different from a measure of the ability to “think.” (In fact, some observers didn’t think the bot was very clever at all.)

Clearly, a better test is needed, and we may have one, in the form of a type of question called a Winograd schema that’s easy for a human to answer, but a serious challenge for a computer.

The problem with the Turing Test is that it’s not really a test of whether an artificial intelligence program is capable of thinking: it’s a test of whether an AI program can fool a human. And humans are really, really dumb. We fall for all kinds of tricks that a well-programmed AI can use to convince us that we’re talking to a real person who can think.

For example, the Eugene Goostman chatbot pretends to be a 13-year-old boy, because 13-year-old boys are often erratic idiots (I’ve been one), and that will excuse many circumstances in which the AI simply fails. So really, the chat bot is not intelligent at all—it’s just really good at making you overlook the times when it’s stupid, while emphasizing the periodic interactions when its algorithm knows how to answer the questions that you ask it.

Conceptually, the Turing Test is still valid, but we need a better practical process for testing artificial intelligence. A new AI contest, sponsored by Nuance Communications and CommonsenseReasoning.org, is offering a US $25,000 prize to an AI that can successfully answer what are called Winograd schemas, named after Terry Winograd, a professor of computer science at Stanford University.

Here’s an example of one:

The trophy doesn’t fit in the brown suitcase because it is too big. What is too big?

The trophy, obviously. But it’s not obvious. It’s obvious to us, because we know all about trophies and suitcases. We don’t even have to “think” about it; it’s almost intuitive. But for a computer program, it’s unclear what the “it” refers to. To be successful at answering a question like this, an artificial intelligence must have some background knowledge and the ability to reason.

Here’s another one:

Jim comforted Kevin because he was so upset. Who was upset?

These are the rules the Winograd schemas have to follow:

1. Two parties are mentioned in a sentence by noun phrases. They can be two males, two females, two inanimate objects or two groups of people or objects.

2. A pronoun or possessive adjective is used in the sentence in reference to one of the parties, but is also of the right sort for the second party. In the case of males, it is “he/him/his”; for females, it is “she/her/her”; for inanimate object it is “it/it/its”; and for groups it is “they/them/their.”

3. The question involves determining the referent of the pronoun or possessive adjective. Answer 0 is always the first party mentioned in the sentence (but repeated from the sentence for clarity), and Answer 1 is the second party.

4. There is a word (called the special word) that appears in the sentence and possibly the question. When it is replaced by another word (called the alternate word), everything still makes perfect sense, but the answer changes.

For more details (including some examples of ways in which certain Winograd schemas can include clues that an AI could exploit), this paper is easy to understand and well worth reading. In fact, it’s so well worth reading that I’m going to steal their conclusion and post it here:

Like Turing, we believe that getting the behaviour right is the primary concern in developing an artificially intelligent system. We further agree that English comprehension in the broadest sense is an excellent indicator of intelligent behaviour. Where we have a slight disagreement with Turing is whether a free-form conversation in English is the right vehicle. Our WS [Winograd schemas] challenge does not allow a subject to hide behind a smokescreen of verbal tricks, playfulness, or canned responses. Assuming a subject is willing to take a WS test at all, much will be learned quite unambiguously about the subject in a few minutes. What we have proposed here is certainly less demanding than an intelligent conversation about sonnets (say), as imagined by Turing; it does, however, offer a test challenge that is less subject to abuse.

It’s worth pointing out that we’re a bit skeptical that you can really “test” for human-level AI in this manner. With a highly structured test with specific questions and answers that are unambiguously right or wrong, there’s a lot of potential for a clever (but not thinking) AI to find ways to exploit it.

The question, then, becomes whether “intelligence” is simply a technological system that is sufficiently complex to correctly answer a series of questions that a slightly more complex biological system (us) has arbitrarily decided constitute a measurement of what thinking requires.

It seems inevitable that at some point, we’ll have to say that true intelligence is feeling as well as thinking, and “Blade Runner” is way ahead of us:

[ Winograd Schema Challenge ] via [ BusinessWire ]

The Conversation (0)