The Liberal Arts Goes Data Mining

Google's Ngram Viewer gives historians and linguists a new perspective on the past two centuries

Loading the podcast player...

Steven Cherry: Hi, this is Steven Cherry for IEEE Spectrum's "This Week in Technology."

The late Jim Gray, database researcher extraordinaire, was one of the first prominent voices for data-driven science. Distinguishing it from the three existing science paradigms of theory, experimentation, and computation, Gray wrote that the "techniques and technologies for data-intensive science are so different that it is worth distinguishing [it] as a new, fourth paradigm for scientific exploration."

The Large Hadron Collider, for example, will generate about 15 petabytes of data per year—a petabyte is a million gigabytes. That's nothing compared to what happens when we map an entire brain, which will involve about a million petabytes of data. Neurobiology, genetics, particle theory, astronomy, chemistry, climate studies, materials science, and network theory are just a few areas already being transformed by large databases.

Now this revolution is coming to the humanities. Google's massive book program, which has already digitized about 15 million books, has spun off an application that gives researchers access to a database of about 500 billion words across 10 language sets and two centuries. Ngram Viewer, as it's called, does more than provide a unique look at the history of words. It promises to change how historians do their work and to change our very picture of history itself.

My guest today is Peter Norvig. He's the director of research at Google and one of the coauthors of the paper in the journal Science that introduced the database to the world. Peter, welcome to the podcast.

Peter Norvig: Great to be here, thanks.

Steven Cherry: Peter, one discovery already made has been "lexical dark matter"…

Peter Norvig: Yeah. So this is the idea that if we look at a dictionary there is a certain set of words that are in there, but if we look at what's actually published in books, there are other words that just don't show up in the dictionaries. Does a word meet a certain level of frequency where it seems to be coming into the language, and then at what point does it officially show up in the dictionaries? And we show that that process doesn't always match, that there are words that are becoming more common that are actually in common usage for many years before they show up in the dictionary, and now we have a way to track that. So somebody was pointing out, you know, why is it that the word "polio" didn't become common until the middle of the last century, whereas you would think it would be much earlier, and it turns out that the answer is that it wasn't called polio until the '50s or so. Before that it was called "infantile paralysis."

Steven Cherry: So etymology is going to be transformed as well—this is just really wild. Tell us some other things that people are already using the database for.

Peter Norvig: I thought one of the interesting things was tracking the rise of fame. Andy Warhol said, of course, that all of us are going to get 15 minutes, and what we can see is that's coming true to some degree, in that the pace of fame is increasing. So we look at famous people for each time period in the past, and as we go farther into the past, you see that there's a slope of fame going up, that mentions of someone's name become more common during their lifetime and that it reaches a peak, and that it starts to trail off. So you have a slope going up and a slope going down as you look at the time graph. And we find that over time as we get closer to the present, both of those slopes are getting faster. So we find that people are becoming famous faster and becoming forgotten faster.

Steven Cherry: That's an interesting thing to find. But the database is limited to books, and I'm just wondering how hard you think it is to draw conclusions such as this one about fame, when fame goes beyond books itself—I mean magazines, newspapers, just to talk about printed matter…

Peter Norvig: Yeah, I think that's right. So this is providing one look on the world, and books do have a different quality than other takes that we do find in a magazine or a newspaper or an Internet message or something. We think books are important for a number of reasons. There's a certain level of quality control. You're going to find less typos and so on because books tend to be proofread. The editors, the publishers thought this was noteworthy enough to actually put it into the record, so it's more formal in some sense than other corpus of information you could find, but it has a certain level of authority and something backed up behind it. And that means it is going to lag a little bit. So that's why we think it's important to have this historical record so that if you really wanted to know what's happening right now, you probably wouldn't look at books because that would lag a couple of years behind. But if we have this record going back in time, then we think it's very useful.

Steven Cherry: I was interested to see that some people are already looking at censorship and propaganda, and it looked like those were pretty timely. There were some examples such as Mark Chagall in Germany. I guess his work was censored?

Peter Norvig: Well, it's not so much his work as the mentions of him. So we looked at people at various places in time, and look at it in different cultures, and we saw that various artists didn't fit what the local state wanted, their mentions started to go down. So if you didn't fit with the Nazis' view of the world, then you weren't mentioned in German books at that time, whereas mentions in the rest of Europe or the United States were still high, indicating that there was some fame for somebody like Chagall.

Steven Cherry: Here again, it seems like books are giving us a view that maybe does have some errors around the margins. So, for example, we're looking at mentions of Chagall during particular years in the German language and in the English language, but of course a German language book could be printed in England, for example, where there wouldn't be censorship of Chagall's name. And similarly, China with regards to Tian'anmen Square, someone in another country than China could publish a book about Tian'anmen Square in Chinese.

Peter Norvig: That's right. So we can't make the determination of doing that. So I think this is a comparison of sort of the wholesale versus retail—that a historical scholar can go through book by book, or original source by original source, and say here's the exact context of where this was published, who it was published by, and how that lays down. But it takes a long time to do that, so you can only do a certain number of them. What we're doing is providing another way to look at it, and not taking anything away from that approach of carefully looking at each piece, but now we're saying we're looking at the bulk. And you're right, we're not going to get everything exactly right. There are going to be these borderline cases.

Steven Cherry: Let's talk about the database itself. We just mentioned the languages. About three-quarters of it is in English?

Peter Norvig: Yeah, so it's concentrating on English. We have the other languages as well. There's about half a trillion words altogether. So it's English, and we have that broken up into American and British, and we have a separate part broken out just for fiction. And we have Chinese, French, German, Russian, Spanish—I think that's it.

Steven Cherry: After the books are scanned into Google Books, what's involved in turning them into this database?

Peter Norvig: So we scan the books, then we do optical-character recognition over the books to figure out the sequences of words, and then we just count them up. How often does each word occur, each two-word sequence, three-, four-, and five-word sequence—count all those up, put them together, and then put them in the right pile. You know, say "Which language is this?" If it's in English, is it British versus American, is it fiction or not, and so on.

Steven Cherry: You know, Peter, I kind of hyped the potential for these large databases to change scholarship in my intro. Do you think I overhyped it?

Peter Norvig: I think it's a tool that provides people the ability to do things they haven't been able to do before. I think like anything else it's just a tool. Maybe if you compare it to paleontology, it's more like the bulldozer than the fine pickax and brush. You know it's not going to find the fine details, but it's going to allow you to look at a lot more than you've been able to look at before. But people are still going to have to have the right ideas. We give you access to this Ngram Viewer, but it's a blank box, and you have to decide what's worthwhile to look at.

Steven Cherry: Well, I'm really enjoying the fact that it's open to the general public as well. A journalist from Mother Jones, in his article about Ngram Viewer, he compared "data is" to "data are" and found that the wide discrepancy between them is closing, and that in other words "data are" is becoming almost as common as "data is," which is tremendous ammunition for me to direct at our copy desk.

Peter Norvig: [Laughs] Well, that's good—always happy to help out there. I think "data" is more of a mass noun, so I think that makes sense to me.

Steven Cherry: I agree. Let me ask you this. All of our listeners can just jump into the database with a simple Web form, basically—you just put in the language you want, the term or terms you want to track, and the years. But I guess you're also making the database available in its entirety to researchers?

Peter Norvig: That's right. So it's easy for anybody to just go in and do a search, but if you're more of a serious researcher and you have more specific questions that can't be answered one search at a time, you can download all the original data, you can run whatever programs of your own you want over that. And that would allow people to do whatever kind of research they want.

Steven Cherry: Tell us about the team you've been working with. I guess it's led by a couple of Harvard researchers?

Peter Norvig: The core team is two researchers at Harvard: J.P. Michael and Rod Lieberman working in Steven Pinker's group. So they worked together with some of the engineers at Google to make the work available, and then there's about a dozen people altogether who pitched in, most of them at Harvard, and then some advisors from Encyclopedia Britannica, from Houghton Mifflin, some people associated with MIT, so quite a team of people helping out.

Steven Cherry: Well, very good. On behalf of all lovers of language and history, let me thank you and Google and the research team for this amazing resource.

Peter Norvig: Great, and go on out there and make some more discoveries.

Steven Cherry: Very good, thanks a lot. We've been speaking with Peter Norvig of Google research about Google's new database of 500 billion words and what it might mean for historical research. For IEEE Spectrum's "This Week in Technology," I'm Steven Cherry.

NOTE: Transcripts are created for the convenience of our readers and listeners and may not perfectly match their associated interviews and narratives. The authoritative record of IEEE Spectrum's audio programming is the audio version.