The Cultural Treasures in Google Ngram

A database of words yields new findings for historians, linguists, and psychologists

Loading the podcast player...

Steven Cherry: Hi, this is Steven Cherry for IEEE Spectrum’s “Techwise Conversations.”

In the classic children’s book The Phantom Tollbooth, the Kingdom of Wisdom is divided by two rulers, the Mathemagician and King Azaz the Unabridged—in other words, numbers versus words.

Quite some time ago, computers and databases transformed the way we deal with numbers. It’s inconceivable that somebody would study physics or finance without mining millions of bits of data. On the arts and letters side of the Kingdom of Wisdom, though, scholarly research is still done by sifting through facts a few at a time, like the gleaners of wheat in old paintings.

Until recently. We now have the ability to run words through the threshing machine, thanks to a remarkable tool. We first reported on the Google Ngram database a couple of years ago in a “Techwise Conversation” with Peter Norvig, Google’s director of research. Since then, historians. linguists, sociologists, and psychologists have begun to see what riches the database can yield.

My guest today is Jean-Baptiste Michel. He’s a post-doctoral fellow in the Department of Psychology at Harvard and coauthor of a paper, “Quantitative Analysis of Culture Using Millions of Digitized Books,” published last year in the journal Science. A highly entertaining version of the paper was given as a TED Talk as well. He joins us today by phone from New York City.

J.B., welcome to the podcast.

Jean-Baptiste Michel: Thank you.

Steven Cherry: Let’s start with the Ngram database. Remind us where it comes from and what an Ngram is.

Jean-Baptiste Michel: Yes. First, I’d like to mention that this work was done in collaboration with Erez Lieberman Aiden, who is also at Harvard. And we’ve worked with Peter Norvig and many other people to create this database. So an Ngram, the thing that we use to study culture, is a single word. And a single word can be understood as a “1-gram.” So a gram is a unit of—a unit of meaning, really, in computer science. It’s a computer science term. So a 1-gram means that you have one unit; “2 grams” means that you have two units. So typically, a gram could be a character, but in this context, a gram is really a string of characters that’s separated by spaces. So you can understand a 1-gram as a word, but you can also have numbers that are 1-grams; like 3.14 would be a 1-gram. A word that is misspelled would also be a 1-gram. So 1-grams are a bit larger than words.

Steven Cherry: Very good. And so this all comes from the Google Books database.

Jean-Baptiste Michel: So, actually the portion that we worked with, which is taken from 5 million books—5.2 really—5.2 million books, represents 4 percent of the total number of books that have ever been written, actually. So it’s a pretty sizable portion of culture as it can be found in books. Now, Google has digitized more books than that. At the time when we did this study, it was around 18 million, I believe. We did not use the remaining books because one of the issues that you have when you deal with 18 million books, say, that you need to be sure that the book is really written when it’s said to be written.

So let me explain this a little bit more. Google went and digitized all these books. But then they relied on some other people like metadata providers, university libraries, to tell them what the book was, who wrote it, when it was written, what it is about. That association between metadata and data is sometimes problematic. So you would often find a book in the Google Books database that has a mistake in the date. So you think that the book was written in 1917, but actually it’s been written in 1920.

Now when you have—when you want to do computational analysis of these books, you need to weed out all these mistakes. So that’s why we selected a subset of these 18 million books that had the highest quality in terms of its metadata. We selected these 5.2 million books for which we were most certain that the date was correct. It took us a while to do that, but the result means that when you plot these 1-grams, these trajectories of words over time, you can be relatively certain that the trajectory really reflects what actually happens in the book record.

Steven Cherry: Very good. So you’re using the Ngram database to study culture, and in the paper you used the term culturomics, which is by analogy with genomics, I guess.

Jean-Baptiste Michel: Yes, exactly.

Steven Cherry: The study of the genome. So, for example, you mentioned in the paper the polarization of the United States into North and South in the run-up to the Civil War in 1861. How does that show up in the database?

Jean-Baptiste Michel: So, in the database, so you could—so you can use the database in two ways, really. One way is you know of something that happens in the past and you want to see how it reflects in the language, in our culture, in the books that were written over the past 200 years. And so here, for instance, you know that the Civil War occurred at a particular date, and you might—and you know that, of course, the Civil War involved the North and the South. And so you can just plot these two words and see that the North and the South are words that gain in prominence. They’re being written more about. The words the North and the South are used more and more in the run-up towards the Civil War.

During the times of wars, actually, you see a lot of—an increased frequency, an increased usage of words like war, but also peace or the enemy or soldiers or things like that are—of course you can think—obviously you’re going to talk more about war and peace during war and peace. Yet it’s pretty interesting to see exactly to what extent these words are used more or less and how, and what their dynamics were before, during, and after the event that you are concerned with.

Steven Cherry: Yeah. The war is an interesting example. I guess there’s the term for World War I—before there was a World War II—was “the Great War.” And you can see that shift from “the Great War” to “World War I” in the database as well. You mentioned the two ways of looking at the database, and I was going to ask you about that anyway because you found something interesting about censorship. And Peter Norvig mentioned this when he was on the show, but I confess I didn’t get the full significance at the time. His example was the artist Marc Chagall. And he said you can see what you would expect in the rise of the use of Chagall’s name as he became more famous through the 1920s and 1930s, except in German texts. And this corresponded to his work being censored by the Nazis because he was Jewish. So you can see the censorship in the Ngram graphs. But in your paper—and this is what I missed with Peter when I talked with him—you talk about going the other way around. That you can mine the data directly for these gaps and find occurrences of censorship that you might not otherwise have known about. So, how does that work?

Jean-Baptiste Michel: Yes, exactly. So that’s actually a very, very powerful signal. So we did not expect to see that really. That’s one of the beauties of this data, is that there’s just so much of it that you discover so many new things that you just did not expect. So here in particular, we had spent some time looking at the trajectories in our culture of famous people to learn more about how fame is reflected in our culture, in the book record. And typically, you have a name that kind of grows and grows and grows, and then—at some point years later, which is some kind of optimum, like 70 years after the birth of the famous person—and then it starts declining little by little. Basically, it grows until it reaches some optimum.

And we looked at, indeed, the German language, and we saw that for a fair bit of names there was actually a huge depression. Like, the name would start growing, growing, growing, but then at some point would start declining and then reaching zero almost for five to six to eight years and then picking up back again. And that’s, of course, during the Nazi regime, so where there was not only censorship but suppression more generally.

So my point is that these signals were very, very strong. You could really clearly see the mark of suppression. Like, you can imagine—if you can visually imagine this—you have a line that kind of goes up and then you imagine taking your thumb and pressing on that line in a particular point down till it hits the ground, the ground zero, where nobody mentions that name any more. That’s a dramatic demonstration of the effect that a particular person can have on the lives of people and their presentation in culture.

So these signals were so strong that we thought, well, we don’t need to know who’s censored. We should be able to discover it by looking for that signal, by looking for basically things you don’t expect. So you could take—in particular, the way we did it was the following. We devised a suppression index, which is—which worked the following way. You take a name, and there’s a specific period in time that you’re interested in—say, during the Nazi regime. And you’ll say, “Well, let’s look at the fame before and after and do a linear interpolation for what I expected, what I would expect the fame should be during the period of oppression. And then I can also look at what the actual fame was during that period of oppression. By dividing one by the other, I have a suppression index.”

And so you can compute that suppression index for any name, for any word for that matter, for any Ngram. So we computed that suppression index for names in the English language, all right? And then we saw that this distribution of suppression indices was very wide. In particular, there were many people to the very far left of this curve. Meaning that, for instance, Pablo Picasso was suppressed 10 times as much as—was suppressed by an index of 10. Meaning that he was talked about 10 times fewer than you would have expected otherwise.

When you have huge databases of text like that, you can detect suppression in the past, but you can also detect suppression in the present, which is extremely important. So using these methods that we’ve introduced in this paper, one could imagine looking for depressions, looking for statistical suppression, in contemporary texts, in contemporary Web pages, in contemporary regimes. So this actually opens the door to the possibility of automatically detecting suppression, quantifying its extent, and describing where exactly in the cultural records it weights...which is, I think, a really interesting direction.

Steven Cherry: I’m glad we’re talking about all these artists, Chagall and Picasso. Because it was another famous artist, Andy Warhol, who said everyone has 15 minutes of fame. And I gather you think that will soon be down to, I don’t know, seven and a half, if isn’t already. So tell us what you found out about fame itself.

Jean-Baptiste Michel: So we basically took all the names that we found in Wikipedia; there’s, like, 700 000 names. And Adrian Veres, who was a student of ours working on this paper, has narrowed this list of hundreds of thousands of people to, like, 200 000, I believe, to—so as to remove duplicate names. One of the issues that some people have—some names appear multiple times. Some famous people share their names with other famous people, which is somewhat problematic.

In any case, we took hundreds of thousands of names and their birth dates, and we looked at the trajectories of these names in the book record. And we assembled cohorts of famous people, so like, say, all the famous people born in 1871. And then we looked at their trajectories and we measured—we devised a measure for their fame, which is a unit of fame really. Actually, Adrian Veres and John Bohannon, a journalist at Science, have devised what they call the milli-Darwin, which is a unit of fame. So the unit of fame is how frequently—is the average frequency with which your name is mentioned between the year you turn 30 and the year 2000, in this particular book record.

And then you compare that measure with that of Charles Darwin, a very famous name. And this division here corresponds to the milli-Darwin. So Charles Darwin has 1000 milli-Darwins. You see, Noam Chomsky has 500 milli-Darwins, which is very, very large. Somebody like Steven Pinker would have 32 milli-Darwins, I believe.

Anyway, so you can rank people. For each cohort of birth, you can rank the people by their milli-Darwins. And then you can look at the most famous people, the top 50 people. And then you realize that when you look at the trajectories of these 50 people—of course, every one trajectory is very different from the other. Like the trajectory of Orville Wright, one of the inventors of the flight, would be very different from the trajectory of Rutherford, the chemist. Nevertheless, on average, the 50 most famous people share a common trajectory.

So here’s how it works. Between their birth—after they’re born, they’re of course not talked about. They are not famous yet. Now—and then after that point, the fame, the median fame of these 50 most famous people, increases exponentially. It doubles every so many years. So there’s a doubling rate of fame. That phase lasts for about 40 years. Then you reach—then this median fame reaches an optimum at the age of about 70, and then it starts decreasing, again exponentially but at a much lower rate. So the half-life there is about 70 years.

So you see that—we basically saw that for extreme fame—basically, the cohort of the most famous people born in a given year—extreme fame has a very specific shape that can be captured by four core parameters: when they become famous, how fast their fame rises, when they reach an optimum, and how fast their fame declines. Now what’s striking is that we can observe how these parameters have changed over the course of the last 200 years.

So we discovered that the fame comes in life earlier, it rises faster, but it also goes away sometime faster, somewhat faster. So to give you an example, the cohort of, I believe, 1800 would become famous in their late 40s. Somebody like Che Guevara, for instance, was famous way before he was 30. I think he was 27 when he was fomenting revolutions in South America. That’s a pretty impressive feat. And so fame comes earlier in life, but it also rises faster. So the doubling time of fame of this cohort of 1925, let’s say, was, I think, three years. So it would double every three years, compared with eight years a century and a half before.

So this dynamics that we’ve observed, this is what led us to joke that Andy Warhol might be right in principle, but the numbers are actually a bit faster than what he thought.

Steven Cherry: Yeah, I guess they will probably continue to keep changing. You mentioned the duplicates, and I wanted to talk a little bit about the limitations of the database. If somebody just uses the term Bush, it could mean the 41st president or the 43rd president—or perhaps just the shrubbery. You have to distinguish between Apple, the computer company, and the apples that we eat. How is this done, and to what extent can you do it?

Jean-Baptiste Michel: So in most cases we cannot do it, to be clear. So there are some cases when we can. So, first of all, the data that we have is case sensitive, so Bush with a capital B would not be the same 1-gram as bush with a small b. So there you can distinguish the vegetable from the person. It’s the same case—the same in Apple, for instance. A small-case apple would be the fruit, and an uppercase Apple would be the company, most of the times.

Steven Cherry: Right. But the two presidents you would not be able to distinguish.

Jean-Baptiste Michel: Right.

Steven Cherry: And you used the example of 3.14 before, but if somebody has 3.14159, we would say they’re referring to the same thing, but they’re different Ngrams.

Jean-Baptiste Michel: Yes, exactly. No, no, definitely. You’re completely right. We would say—exactly. That’s a very good point you’re making. So in the case of George Bush Senior and Junior, there’s a major problem, and I don’t think—we cannot distinguish them just on the basis of their names.

Steven Cherry: Another limitation is you’re only looking at books. It occurs to me that—well, speaking of the Bushes, you found that being a politician was better from a fame point of view in the long run to being an actor, say, but being an actor was better in the short run. But it occurs to me that that’s to some extent, I would think, a consequence of looking at books instead of newspapers and magazines, right? I mean, magazines are still writing about Michael Jackson and Amy Winehouse, and they’re not writing about the Bushes so much.

Jean-Baptiste Michel: Yeah, I think you’re right. No, of course. So we—so of course we’re looking at the book record. And one thing needs to be really clear. We tried to make it clear in the paper and every time we speak, but just to make it very, very clear, this is just one fraction of culture, and culture as a whole, of course, has infinitely more facets than what can be seen in the books. Some aspects of culture are not even written, like the way, you know, you wear your baseball cap, for instance—whether it’s, you know, turned on. The way you put the hat on your head would be an element of culture that is not really recorded in books. So there’s lots of culture—there’s a big part of culture that cannot be found in books.

We can study 4 percent of all books ever written, which is large. It’s not yet easy to access all the newspapers. It’s not yet possible to access all the artwork. It’s not yet possible to access all the archeological record. But in my lifetime, clearly, that will be the case. And then we can decide, we can discover what parts of the early findings were specific to the medium that they were—that they emanated from, as opposed to what parts of them are general.

Steven Cherry: J.B., you mentioned Steven Pinker, the famous biologist. And you, yourself, have an M.S. in applied mathematics and a Ph.D. in systems biology. So, what’s the connection to biology?

Jean-Baptiste Michel: So to me, language and culture are biological objects, in essence. And they’re extremely important to—I mean, they are obviously important to all life on earth. Like, human culture is transforming all life on earth. So it’s a really important subject of biological study.

Steven Cherry: Well, I think it’s just an amazing branch of study that you’re opening up. I mean, we can see—I mean, there’s the whole comparison between memes and genes. In the long run, perhaps, with a little more metadata we can see ideas spread across different societies as well as time.

Jean-Baptiste Michel: Yeah. Yeah, no, exactly. I mean, that was our—I mean, one of the intents was really to...so memetics is a very interesting theory, but it remains a theory because there was really no data, essentially, to compare it against. It’s extremely limited data. And so we thought, “Well, because of the digitization of the historical record, we can now have data against which we can measure, really, the effectiveness of theories of, let’s say, memetics.”

So when we studied—when we started this study, we had in mind things like, we want to quantify the evolutionary rate of language, things like that. And so that was our angle into the data. But then when we got the data, we realized that there’s just so much more to it.

Steven Cherry: Well, J.B., we really appreciate your work in threshing through all of this data and for joining us today.

Jean-Baptiste Michel: Sure, no problem. Thank you very much for having me on the show.

Steven Cherry: We’ve been speaking with Jean-Baptiste Michel, a mathematical biologist at the Department of Psychology at Harvard, about the cultural treasures being revealed in the Google Ngram database. For IEEE Spectrum’s “Techwise Conversations,” I’m Steven Cherry.

Announcer: Techwise Conversations is sponsored by National Instruments.

This interview was recorded 26 June 2012.
Segment producer: Barbara Finkelstein; audio engineer: Francesco Ferorelli.

Read more Techwise Conversations or follow us on Twitter.

NOTE: Transcripts are created for the convenience of our readers and listeners and may not perfectly match their associated interviews and narratives. The authoritative record of IEEE Spectrum’s audio programming is the audio version.

Advertisement
Advertisement
 
Computing

Data Mining for Flu Symptoms

Last Tuesday, Google opened its free Internet service called Flu Trends which it claims can let you know whether â''the number of influenza cases is increasing in areas around the U.S., earlier than many existing methods,â'' according to a report in the Wall Street Journal. How does Google do it? According to its website, â''We have found a close relationship between how many people search for flu-related topics and how many people actually have flu symptoms. Of course, not every person who searches for "flu" is actually sick, but a pattern emerges when all the flu-related …