Facebook Knows Your Friends—Even if They’re Not on Facebook

Facebook can infer many things, even about people who deliberately stay away

Loading the podcast player...

Hi, this is Steven Cherry for IEEE Spectrum’s “Techwise Conversations.”

We all have things we don’t want to put on Facebook, and for some, the loss of privacy is so large that they stay off the social network entirely. But it turns out that, to quote heavyweight champion Joe Louis, “You can run, but you can’t hide.”

To quote a research paper published last month on PLoS One, “With the help of machine learning, social network operators can make predictions regarding the acquaintance or lack thereof between two nonmembers with a high rate of success.”

It’s been known for a while that Facebook makes a shadow profile of people it learns about who aren’t on Facebook. What the researchers here found was that they could predict, with a surprising degree of accuracy, whether two such nonmembers were acquainted with each other.

The paper, entitled “One Plus One Makes Three (for Social Networks),” was written by a team from the University of Heidelberg, and my guest today is the corresponding author, Katharina Anna Zweig. Nina, as her colleagues call her, is a professor of theoretical computer science, specializing in network analysis and graph theory. She joins us by phone from her home in Heidelberg, where it’s late at night already.

Nina, welcome to the podcast and guten abend.

Katharina Anna Zweig: Thank you very much for this invitation to speak to you.

Steven Cherry: So, there’s maybe several parts to this. First, social networks often make guesses about whether two people know each other, and we’ve all seen this. LinkedIn, for example, has a “people you may know,” Facebook has a whole column on the side called “Know them?” Google + has something similar. How do these networks make those inferences?

Katharina Anna Zweig: So, I don’t know, actually. So we—actually, this was the starting point of our project. So my colleague, Professor Hamprecht, got such an e-mail, like “You are a nonmember, but you might know these members of”—in this case—“Facebook,” and he was startled. Because these people were not people that invited him personally. It was—it was a good set of people, and he actually knew almost all of them, and he’s in machine learning, so he—he was really startled, and so the...how? How do they know? And then we found out that whenever you register with Facebook, Facebook will ask you for your full e-mail address book. And so it’s quite easy for them to know that these members know him, even if they didn’t choose to invite him personally.

Steven Cherry: Okay, so tell us about your research and how you tried to figure out what Facebook was doing.

Katharina Anna Zweig: So as we teamed up with a machine learner and a network analysis guy; then we wanted to understand whether we could infer something about the relationship between nonmembers if we had information about the network between members and who they know outside of any social network. And so what we did is that for each two nonmembers, we looked at the people that are on the membership site and how they are connected to each other. And you can, you can call this “features.” So, for example, we looked at person A and person B, and these are nonmembers. And we looked at how connected...their friends on the social network platform would be connected. And this would give a vector of 15 features—properties. And then with a quite standard machine-learning algorithm, we tried to give a prediction value to these vectors, and what happens then is that for each two people we get a score. And we would then say that the highest 10 percent or 20 percent we would predict as being connected, and then we can check in our data whether they really are connected or not. And this is what we did. And yeah, we saw that we can under some assumptions, we can predict about 40 percent of the connections between nonmembers correctly.

Steven Cherry: We should point out that we, I guess, you were looking at both predicting that members were friends when they were and were not when they were not. And that both went into the 40 percent.

Katharina Anna Zweig: Yes, it did. In a way, because it’s always easy if you predict everyone to be connected, then you will always get a 100 percent score, but you will make a lot of wrong guesses.

Steven Cherry: Yeah, I guess we should talk about the machine learning a little bit. I gather this involves taking a training set, so you take a sample where you do know everything, and you figure out basically an algorithm, and then you—I’m sorry—you figure out an algorithm, and then you point it to a training set where you already know all of the answers.

Katharina Anna Zweig: And of course we also know it for doing the quantitative analysis. So we not only know the answers for the training set, but we also know it for the nontraining sets, and this information is used to assess the quality of the algorithms. So the training phase is used to find out—so I told you earlier about these 15 properties in the vector. And not all of these properties are really important. The machine-learning algorithm helps us to find out which of the properties is really important for predicting whether two nonmembers are connected or not. And the machine-learning algorithm understands, based on some training sets where we tell him, which of the nonmember pairs are really connected, are friends. Based on this, the machine-learning algorithm learns ways. So, for example, he learns that if two nonmembers have at least five friends together on the social network platform side, and these five friends are also connected with a probability of 0.5, then this is a very good indicator that the two nonmembers are also connected or are also friends. And so this is the first phase of the machine-learning algorithm, where we learn the ways of the teachers. And then [in] the second phase, we will then make our predictions on a set of samples that the algorithm has not seen before. And these two sets are really independent, so there’s no choice of doing anything spooky there. And in the second phase, we will then evaluate how good our algorithm was. So hopefully in the second phase we do have the real relationships, and we only try to predict as many relationships as there are really in the data sets. And based on this known “ground truth,” we call it, based on this, we can say how many of the ones that are really there were predicted by our algorithm.

Steven Cherry: So what was your—what was the set of people that you worked on?

Katharina Anna Zweig: Okay. So, this is a very good point, crucially, so we were able to get access to five real Facebook data sets from five universities. So this is data from 2005, and at this point almost 80 percent of all of the students in these five universities were really members of Facebook. And we know the total network structure there. What we did not know is who these people knew outside. Yeah, so we just had the data between the people on Facebook. So, what can you do if you don’t have the data that you need? You need to emulate it in a way. So we took this data and had five very different evolutionary models of how people decide to become a member of a new social network platform. So how did you decide to become a member of any social network platform? Maybe one of your friends was already there. Or you thought it was just a very cool social network platform. And so our five models scales between a totally dependent behavior, where your own decision depends on how many of your friends already are members, and the totally independent decision. And we have five different models, but we’re kind of scaling between these extremes. And we used these models to artificially divide this real data set into members and nonmembers. For this artificial division, we knew exactly who is connected to whom, but our algorithm would not know this. So for all five evolutionary models of how somebody decides to become a member, we got almost the same results qualitatively.

Steven Cherry: Now, it turns out that the prediction is more accurate, I gather, the more, the higher percentage of people are on Facebook. So I guess in the United States, for example, something like 70 percent of all adults in the U.S. are on Facebook, and that makes it relatively easy, I gather, to make predictions about the remaining 30 percent.

Katharina Anna Zweig: Yes it does. So this is one parameter we took into account. I really need to stress that we are not talking about Facebook because we are just talking about all social network platforms that have information about their members and the connection of members to nonmembers. So of course you’re totally right that this parameter is very important. If we were looking at a social network platform where only 5 percent or 10 percent of any population is a member of, then we couldn’t do much because basically machine learning is always depending on statistics. But what we saw is that, first of all, we don’t need to look at the full population. Sometimes we might be only interested in schoolkids or students of a given university. And in the subpopulation, as you said, the percentage might be very high and might go to 70, 80 percent. For our machine-learning algorithm, we don’t need much more than 50, 60 percent.

Steven Cherry: So I guess the lesson here is that if people stay off of a social network, whether it is Facebook or any other, because they basically don’t want that social network to have information about them, in a way, staying off may not be sufficient, because the social network may know about them through the people who are members and can start to collect information and make inferences about them, even if they’re not members.

Katharina Anna Zweig: You know, I’m a scientist, I always want to say that “What can I really infer, and what can I not infer?” Ah, yeah, there is an important point. The important point is that in our study, we only used the contact data. Because we didn’t have any other social information about the people. So if we were a social network platform, we would have additional information on the age, on the location, on the education, whatever. And if you take this additional information into account, it’s very likely that you can do a better inference than what we did now.

Steven Cherry: So these supplement the 15 features or replace some of them.

Katharina Anna Zweig: Totally.

Steven Cherry: And the accuracy could potentially be much, much higher than that 40 percent.

Katharina Anna Zweig: Yes. So this is what my experts, my colleagues in machine learning, say. They expect that if you take these additional information into account—we know that a connection between people is always driven by homophilies, so it’s much more likely that two people of the same age will be connected than two people that are very different in age, and so on. So taking this additional data into account would certainly improve the quality. And also the social network platform has a time, or dynamic, information, because they can see when a former nonmember...whether their predictions were right or not. So...and it’s very, very important to us that we don’t say that social network platforms are currently doing this, and I also wouldn’t know what type of harm can be done with it. And so our study is not about this. We don’t know anything about this. It’s just about this fact that staying away from social network platforms is not enough. And I was so much surprised, because when I thought about these results, I felt guilty. So we have a very famous German magazine, and I often sent an article to my brother, to my father, via this website, yeah? I just click on the link “send article to friend.” And so I’m actually doing the same because this platform can see what types of articles I like. And when I give an information or when I enter the e-mail address of my brother, then this platform also gets information that I have a relationship to this person and that I think this person is interested in the article. So I think actually our study hints at something much bigger. We are constantly leaving information about relationship to others, and relationship is by definition an information between two people. But if I give you the information that I know my brother, I know this person, can he use this information? Is this okay or not? And I think as a society we need to speak about this. What’s—what do we want companies or others to do with information about a relationship that is only revealed from one of the two sides?

Steven Cherry: And the information implied by these connections can be pretty personal. They could involve political beliefs; they could involve, perhaps, illnesses that a person might be suffering from; they could involve a person’s sexual orientation—or anything.

Katharina Anna Zweig: Yes, definitely. So I saw on some wall on a social network platform something like “Get well soon.” And this...this was an information that the other person revealed about the person whose wall it was. So I think we cannot—I think we cannot really prohibit that our friends are posting these informations, but we need to talk about whether this information can be used or should be free in any way.

Steven Cherry: Very good. Well Nina, as social networks become more powerful, it’s very important to figure out just how powerful they’re becoming, so thank you for your research, and thank you for joining us today.

Katharina Anna Zweig: Yeah, thank you very much.

Steven Cherry: We’ve been speaking with network researcher Katharina Anna Zweig about how when a social network gets big enough, it can figure out a lot about the people who stay off the network. For IEEE Spectrum’s “Techwise Conversations,” I’m Steven Cherry.

Announcer: “Techwise Conversations” is sponsored by National Instruments.

This interview was recorded 9 May 2012.
Audio engineer: Francesco Ferorelli

Read more Techwise Conversations or follow us on Twitter.

NOTE: Transcripts are created for the convenience of our readers and listeners and may not perfectly match their associated interviews and narratives. The authoritative record of IEEE Spectrum’s audio programming is the audio version.