Winograd Schema Challenge Results: AI Common Sense Still a Problem, for Now

A Turing test alternative, the Winograd Schema Challenge aims to determine how well AI handles commonsense reasoning

7 min read

The Winograd Schema Challenge is one alternative to the Turing test
An alternative to the Turing test, the Winograd Schema Challenge aims to determine how well an AI system handles commonsense reasoning. The challenge tasks computer programs with answering a specific type of simple, commonsense question called a pronoun disambiguation problem.
Illustration: Shutterstock

After a chatbot pretending to be a 13-year-old named Eugene Goostman “passed” a Turing test a few years ago, experts in artificial intelligence got together and decided that a traditional Turing test might not be all that effective in measuring the intelligence of a computer program after all. Instead, they came up with (among many other things) the Winograd Schema Challenge, which is intended to determine how well an artificial intelligence system handles commonsense reasoning: understanding the basics about how the world works, and implementing that knowledge in useful and accurate ways.

A few weeks ago, the very first Winograd Schema Challenge took place at the International Joint Conference on Artificial Intelligence in New York City. We spoke with Charlie Ortiz, director of the Laboratory for AI and Natural Language Processing at Nuance Communications and one of the organizers of the Winograd Schema Challenge, about how things went, why the challenge is important, and what it means for the future of AI.

The Winograd Schema Challenge tasks computer programs with answering a specific type of simple, commonsense question called a pronoun disambiguation problem (PDP). Here’s an example of a PDP, taken from a children’s book:

Babar wonders how he can get new clothing. Luckily, a very rich old man who has always been fond of little elephants understands right away that he is longing for a fine suit. As he likes to make people happy, he gives him his wallet.

This phrase has five pronouns in it, which we’ve highlighted. Each pronoun could refer to either Babar or the very rich old man. Solving the problem means successfully determining whether each pronoun refers to Babar or to the old man.

As humans, we use common sense and context to figure out who each pronoun is referring to, but this is a real challenge for AI systems. Let’s look at that first pronoun: “he is longing for a fine suit.” It’s obvious to us that the “he” is referring to Babar, because in the previous sentence Babar wants new clothing, and we know that a fine suit is a kind of clothing. For a piece of software to arrive at the same conclusion, it would have to understand the same thing. It could potentially do this through word analysis, which would likely show that statistically, “suit” and “clothing” are related words.

The problem gets a bit more difficult if we choose a different pronoun, like the second one: “he likes to make people happy.” To figure out who this “he” refers to, you have to understand that giving people (or elephants) things makes them happy, and that the old man, being rich, is in a position to give Babar the thing that he wants. This is common sense for us, but not for an AI.

A Winograd schema is a specific kind of pronoun disambiguation problem that adds one extra feature: a “special word” that you can switch around, which will reverse the correct answer without compromising the problem. Here’s an example that Stanford computer scientist Terry Winograd himself came up with:

The town councilors refused to give the demonstrators a permit because they feared (advocated) violence. Who feared (advocated) violence? 

By switching “feared” to “advocated,” you change the object of “they” from the town councilors to the demonstrators. Having the special word in there makes the problem more difficult to attack statistically, since it becomes less useful to simply search for cases where two words are textually co-located.

Now that we’ve gone over how Winograd schemas work and why they’re a challenge for AI, here’s our interview with Charlie Ortiz on the results of the first Winograd Schema Challenge.

IEEE Spectrum: Can you tell us about the history of the challenge?

Charlie Ortiz: It all started about three years ago. [The AI research community has] a biennial symposium on commonsense reasoning, and we met in Cyprus to try to identify a good challenge problem to try to gauge AI progress. It was known that there were issues with the Turing test, and there were many research groups in other areas such as learning, natural language processing, and computer vision that had challenge problems, whereas we really didn’t have one. We talked a lot about this, and we identified the Winograd Schema Challenge as being something that might fit the bill. Myself being at Nuance, I was able to get Nuance to support this with a prize fund.

What is commonsense reasoning in the context of artificial intelligence, and why is it important enough to have a dedicated challenge?

It’s been known for many years that one of the main obstacles to general artificial intelligence is an ability to have common sense, and there is so much [commonsense knowledge] that it’s hard to encode it all. These are facts about the world that you and I take for granted, and when you’re building AI systems, you need a lot of that in order to make sense of conversations and interact with humans. It’s fundamental to making progress in general AI, and most of the major breakthroughs in AI have been more in focused, limited domains. This challenge is an effort to bring this issue more into the forefront, so that research groups will become involved in trying to solve this problem.

“The hope is that now we have a test that’s better for commonsense knowledge and reasoning, and accurately measures whether or not a program really, actually has that commonsense background knowledge, and isn’t just using statistics to guess a solution to the problem.”

What makes Winograd schemas ideal for this challenge?

First of all, in contrast to the Turing test, you have to answer them. It’s multiple choice, so you can’t trick your way out of it, and if you answer it incorrectly, you get it wrong. So, we can measure how well a program is doing. With a Turing test, there’s just no way of calibrating things in that way. In a pronoun resolution problem, the examples are ones that involve common sense in order to disambiguate the referent of the pronoun. They require more than just knowledge of language to get the right answer. That makes for a very crisp, concrete test. The hope is that now we have a test that’s better for commonsense knowledge and reasoning, and accurately measures whether or not a program really, actually has that commonsense background knowledge, and isn’t just using statistics to guess a solution to the problem.

How did the challenge go this year?

We didn’t know what to expect, but we were happy that we got a number of groups participating. When we presented the results, we were also happy that we had a completely full room, a full audience of researchers who came to listen to the results and participate in the discussion.

We had six entries; one of the teams submitted three different approaches. The scores, I think, highlight the difficulty of the challenge. The highest score was 58 percent, and the lowest score was 32 percent. There were 60 questions; completely random answering of the test would have yielded a result of 44 percent. So, it’s a bit better than random, but obviously, it’s not good enough, and it emphasizes the difficulty in what seems like a very simple problem. We’re hoping to have a much larger group participate in the next competition, which will be at the AAAI conference in 2018.

Do you have a sense of how these algorithms were able to answer the challenge questions either as well as they did?

One of the best approaches was from Nikos Isaak at the Open University of Cyprus. They used a hybrid approach that combined extraction of knowledge from the web using a probabilistic engine. Quan Liu, from China, had three approaches that were variations on deep learning. We also did some [human] subject tests, and humans that took these tests didn’t get them all correct. The percentage was a little over 90.

As AI gets better at this challenge, how will this make a difference in software that we might experience?

Two places. One claim is that all programs that claim to be intelligent will require commonsense knowledge and reasoning. So, as performance on this challenge improves, we expect that that will be reflected in how well personal systems and other AI programs might perform. The other clear place would be in conversation and dialogue systems, where you have to interact with a human user in more than just a one-shot-question answer.

Do you have expectations for the next challenge in 2018 based on what you saw this year and what you know of the space?

No, we really don’t know. We’re hopeful that we’ll have greater participation and that we’ll see some sort of improvement. I think after a couple of these, we’ll be better positioned to understand what might need to be done and what approaches are doing well. But it’s a foundational problem in AI, it’s a very very difficult problem, and I don’t think by the next one anybody is going to pass. But we’ll see.

If a computer program is able to convincingly win this challenge, what would you be able or willing to say about that program in terms of artificial intelligence?

Charlie Ortiz: That it’s demonstrated a significant amount of commonsense knowledge, the kind that humans make use of in their day-to-day lives, and that autonomous systems will be better because of that. That doesn’t mean that the entire AI problem will be solved; there are lots of other problems, but it would indicate a milestone in AI advances.

I think everyone would agree that the program will have reached a level of intelligence that’s very desirable from a research point of view. The Winograd Schema Challenge is just one alternative to the Turing test that addresses a different aspect of intelligence that needs to be measured. It’s meant to motivate research in this area, which is a very important area, and to provide some kind of metric for [general AI] progress.

How do you see this fitting into the overall future of general artificial intelligence, where a computer is aware of things on some level that’s comparable to a human?

We want AI programs that have general intelligence. Most of the successes in AI have been in limited domains, the most recent being Go, and before that we had “Jeopardy!,” and chess. These are all well defined problem domains, but they don’t demonstrate any kind of general intelligence that if it were instantiated in a robot or piece of software could solve the problems that you and I encounter every day. That’s where the field is going, and there needs to be an effort in that direction, and this is part of that effort.

The reason that Nuance is backing the Winograd Schema Challenge is that they’ve been working on conversational AI for use in as an interactive personal assistant. The next generation of these assistants (think Siri or Cortana) will be able to engage in dialogue with you, rather than just answering questions. To do that, they’ll need to have common sense and contextual understanding, which is at the core of the Winograd Schema Challenge.

The next Winograd Schema Challenge will be held at the Association for the Advancement of Artificial Intelligence (AAAI) Conference in New Orleans, La., in February of 2018.

[ Winograd Schema Challenge ]

The Conversation (0)