Stephen Cass: What makes a book successful? There is a small industry today catering to publishers and authors who would very much like to know the answer to that question. But predicting the reception a book will receive has tended toward qualitative, rather than quantitative, approaches.
Yejin Choi and her colleagues Vikas Ashok and Song Feng, at the department of computer science at Stony Brook University, in New York, are changing that. They are pioneering a technique known as “statistical stylometry,” which looks at the structure and word choices used to construct sentences. By analyzing thousands of books in eight genres, they believe they can predict the success of a book with as much as 84 percent accuracy, depending on the genre.
Yejin Choi joins us today by phone from her office in Stony Brook to discuss statistical stylometry and some of its implications.
Yejin, welcome to “Techwise Conversations.”
Yejin Choi: Hi. Thank you.
Stephen Cass: Let’s talk about your methodology. First of all, what do you mean when you classify a book as “successful”?
Yejin Choi: So, actually, defining the success is a tricky question. In our study we mostly rely on the download accounts provided by Project Gutenberg. That’s the website that has a collection of novels freely available. And in our study we found that the download accounts correlate with literary quality, which might have led the books to even receive some prizes, like a Pulitzer Prize or a Nobel Prize. In those cases, usually commercial success would follow. But as we all know, there are books that were great hits commercially but without literary quality. But in our study, what we mean by “success” aligns more with literary quality than commercial success, even though those two can correlate sometimes.
Stephen Cass: What kind of books did you analyze? You broke them into genres. What did those genres include?
Yejin Choi: So these genres included short stories, sci-fi, adventure, love stories, historical fiction, even includes poetry, and we also included movie scripts.
Stephen Cass: Without getting too technical, what were some of the metrics you used to analyze and dissect these texts?
Yejin Choi: What computers can analyze there is definitely a rich spectrum of linguistic elements that we can think about for writing style. But in the end we need to think about the kind of things computers can understand and analyze, and those are actually simpler ones. For example, the distribution of lexical choices in particular single-word choices and two-word phrases, or, so, the distribution of word categories, and what I mean by that would be part of speech, such as nouns or verbs or prepositions, articles, and so forth. And also we looked at sentence structure using what we call probabilistic context-free grammar. This is sort of grammar for computers to understand the human language. It’s much different from the English grammar as we know—much simpler—but expressive enough to analyze certain grammatical constructs that characterize writing styles. For example, whether or not there are more relative clauses used in the sentence.
Stephen Cass: What typifies successful and unsuccessful novels?
Yejin Choi: So, there are a number of things I should go through. We found that one thing that surprised us is that successful novels use more nouns and adjectives than verbs and adverbs, use more prepositions and conjunctions, and these distributions of parts of speech are somewhat similar to the writing style of journalism, in fact. So in other words, more highly successful novels resemble a little bit more to the writing style of journalism than their less-successful counterparts.
Stephen Cass: Fascinating. So I noticed that in some genres your techniques were more successful than others. What were the most successful genres for you?
Yejin Choi: So that was adventure genre, for which we were able to achieve about 84 percent accuracy. On average we were able to achieve about 77 percent accuracy.
Stephen Cass: And what category were you least successful in?
Yejin Choi: That was the history fiction, and honestly, I’m not sure why that was the case.
Stephen Cass: Have you had any interest in publishers looking for a “prediction machine”?
Yejin Choi: Yes, actually. I’ve been receiving inquiries about whether we could make these technologies available for them for a variety of reasons. Some were publishers trying to access to the writings of novelists more efficiently and effectively. Also, I was getting inquiries from novelists as well, wondering if they could somehow make use of our technologies.
Stephen Cass: So that actually leads me on to my final question. Is there a danger that if statistical stylometry does become more commonplace, its predictions will become first self-fulfilling prophecies and then less acute at sorting the good from the mass of the not so good, because writers will deliberately try to write in ways tuned to score highly?
Yejin Choi: I like that question. I think it may not be very easy for a writer to just try writing differently to fool the classifier. That’s not very easy, because it’s actually systematic change of writing, and just tuning a few evidences we mentioned in our study will not be enough because the computers are looking for hundreds of thousands of little evidences, so you actually need to be able to address these collectively. And another question is whether these technologies would be able to always lead to the correct decision for publishers. And I would think the answer is no, but these technologies could serve as a reference point for humans in the end.
Humans need to be the final judges, because computers can’t appreciate the content of the novels in the way humans can. Computers don’t have access to commonsense knowledge, or they don’t share the spectrum of emotions that humans do. That being said, I think it could potentially increase the degree of connection between publishers and novelists. Instead of going through manuscripts one by one, the publishers might be able to browse a larger collection of books more efficiently that way.
Stephen Cass: Terrific. Well, Yejin, thank you for joining us today.
Yejin Choi: Thank you for having me. It was my pleasure.
This interview was recorded Wednesday, 26 February 2014.
Audio engineer: Francesco Ferorelli
Segment producer: Barbara Finkelstein
Photo: Tim Hester/iStockphoto
NOTE: Transcripts are created for the convenience of our readers and listeners and may not perfectly match their associated interviews and narratives. The authoritative record of IEEE Spectrum’s audio programming is the audio version.