Eugene Goostman, a chatbot masquerading as a 13-year-old Ukrainian boy (with a notably short attention span), finally passed his exams. The program, whose development started in 2001—so he is really 13 years old—beat competitors by scoring 33 percent in Turing Test 2014, an event held at the Royal Society in London and organized by the University of Reading.
Eugene (or “Zhenya,” as he told one judge he likes to be called) thus became the first in that competition to meet the criteria of the “imitation game” for artificial intelligence, suggested by cybernetics pioneer Alan Turing in 1950. The test is simple: converse so naturally that human interlocutors think they are talking to another person.
Eugene Goostman started a-life in Saint Petersburg, as a project of Vladimir Veselov (whose LinkedIn page indicates that his day job is engineering software for Amazon), Eugene Demchenko, and Sergey Ulasen. According to Eugene’s Wikipedia page, he hails from Odessa, is the son of a gynecologist, and owns a pet guinea pig.
As impressive as the technology is, the developers’ ability to craft a backstory that makes the machine’s responses credible is the real key. Context is all.
“Our main idea,” said Veselov after the win, “was that he can claim that he knows anything, but his age also makes it perfectly reasonable that he doesn't know everything. We spent a lot of time developing a character with a believable personality. This year we improved the 'dialog controller' which makes the conversation far more human-like when compared to programs that just answer questions. Going forward we plan to make Eugene smarter and continue working on improving what we refer to as 'conversation logic.'"
The competition called for a judge to conduct a series of five-minute conversations in real time with two “people,” one of whom was carbon-based and the other, digital. Eugene Goostman persuaded 10 judges that he was the human participant in 33 percent of the 30 conversations in which he participated. Each of the five competing AIs participated in 30 conversations.
The competition’s cut-off is 30 percent. Eugene Goostman hit 29 percent in Turing Test 2012, the previous competition. That was the top mark of the year, but not quite enough to be declared (almost) a real, live boy. This year’s 33-percent score was enough to pass.
The developers have reportedly made an instantiation of Eugene Goostman available online, but the surge in traffic following the Turing Test 2014 announcement seems to have crashed the server. And transcripts of this year’s Turing Test are not yet available. But Time writer Doug Aamoth published an interview with Eugene Goostman on 9 June. As Aamoth observes:
Passing the Turing Test is less about building machines intelligent enough to convince humans they’re real and more about building programs that can anticipate certain questions from humans in order to pre-form and return semi-intelligible answers.
In their opening exchange, Aamoth asks Eugene, “How are you adjusting to all your new-found fame?” To which Eugene replies, “I would rather not talk about it if you don’t mind. By the way, what’s your occupation? I mean—could you tell me about your work?”
The Guardian published samples of Eugene’s conversations from Turing Test 2012. And those interested in a more detailed report of that test should take a look at a 2013 paper by the Turing Test director in IEEE Transactions on Computational Intelligence and AI in Games.
In both sets of exchanges, there are lapses of grammar, gaps in knowledge, and sudden changes of subject that might only be plausible in an early adolescent whose native language is not English, and who may have just a touch of attention deficit hyperactivity disorder. But the pattern will not seem totally alien to anybody who has had a tween child, taught in middle school, or, indeed, been a 13-year-old boy himself.
In 2011, at the Techniche festival in Guwahati, India, an application called Cleverbot took part in a Turing-type test and was perceived to be human by 59.3% of its interlocutors (compared with a score of 63.3% for the average human participant). However, because the program draws on a database of real conversations, many disputed whether it was in fact exhibiting true "intelligence."