Your Favorite Stores Know You All Too Well

A Techwise Conversation with data-mining expert Eric Siegel

Loading the podcast player...

Hi, this is Steven Cherry for IEEE Spectrum’s “Techwise Conversations.”

Back in 2004, Spectrum had a special issue on technology and privacy. In an article with the literary title Sensors and Sensibility, we quoted a privacy expert who said, “Cash registers are no longer adding machines with cash drawers. They’re high-speed, data-collecting computers with connections to the Internet. And shopper loyalty cards tie that data to your identity. The whole goal is to figure out everything you can learn about your customer. We’re creating a retail zoo, where customers are the exhibits.”

That was eight years ago. Since then, computers have improved at a Moore’s law pace, data-modeling methods have improved even faster, and all those databases have eight more years’ [worth] of data. It’s not just what we tell our favorite shops about ourselves, it’s what they can also infer—a recent New York Times article noted that stores can now tell that a woman is pregnant almost as soon as she knows it herself, and they can send her coupons for the right products right when she needs them.

This new science of figuring out what you need and when—all to get you to clink or swipe or reach for your wallet—is called predictive analytics, and my guest today is Eric Siegel, president of Prediction Impact and the conference chair of Predictive Analytics World. He has a Ph.D. in computer science from Columbia University and taught there for several years. His company is in San Francisco, but he joins us by phone from New York City.

Steven Cherry: Eric, welcome to the podcast.

Eric Siegel: Thanks, Steven. It’s a pleasure to be here.

Steven Cherry: Eric, yesterday I watched a 13-minute video on your site. Before I could watch it, I had to give an e-mail address. There was other information I could give—my name, phone number, and so on—but only the one required field. Was that just to get me on your mailing list, much can you figure out from an e-mail address?

Eric Siegel: Hah. Well, it depends on the person’s e-mail address. It’s actually a pretty tantalizing example when we’re talking about predictive analytics. So, you know, taking a step back, predictive analytics, we define it as data-driven technology that generates data-driven scores for each individual consumer or prospect or other organizational elements. And in general the value, the thing we want to figure out about people as companies, and in business applications—although the applications go far beyond other sectors, including law enforcement and health care—is something about what they’re going to do. So it does come down to prediction, where you’re going to get value. And an e-mail address often is very predictive of the chances that you’re going to convert, for example, from a free to a paying member. Not in the case of my business, but in the case of clients I’ve worked for, such as online dating sites, where there may be a free membership level. And the chances that somebody’s going to upgrade their membership to some kind of premium or paying does turn out to be very well correlated with the domain of their e-mail address. And so it’s a predictive thing. Yes—you know, we’re a very small business, but it behooves us to collect contact information to have marketing e-mail lists.

Steven Cherry: An e-mail address is nothing compared to those shopper loyalty cards, and we know how these work for the customer: You let the supermarket or drugstore or whatever swipe your card or bar code tag on your key chain and you get a discount or some other reward. But why don’t you tell us what goes on from the store’s point of view?

Eric Siegel: Well, I thought that that was an entertaining quote that you read, that retail stores are now the new zoo, and the customers are the exhibit. I think that when people are being observed, whether it’s part of the social sciences or it’s in the name of improving the efficiency of marketing, just because you’re observing a person does not mean that they’re being treated like an animal. It’s in fact an incredible value, not only to the company but even to the end consumers and many applications where marketing is made more targeted and more efficient. For example, you’re going to get less junk mail at home if they can predict ahead of time which of these things—are going to spend the money printing and mailing, not to mention whatever environmental impact there is from the use of trees to make paper and such—before doing that, if they can predict, ”Hey, this person’s really much less likely than average to pay any attention and to respond positively to this marketing outreach.”

Steven Cherry: So let’s just talk about the processes themselves. There are software models, I guess provided by companies like yours, and then the data is kind of poured through them. Where do the models come from?

Eric Siegel: Yes, so predictive analytics is essentially engaging in a learning process, otherwise known as predictive modeling. In an academic and research world, it’s often referred to as machine learning. Because basically by using data, and by leveraging it as the true asset it is, we’re able to generate these models—not that they necessarily predict accurately. So, for example, in the recent story in The New York Times that you alluded to, actually originally broke a year and a half ago at the conference I chair, Predictive Analytics World. The speaker from Target was presenting a keynote, and he told the audience that they were endeavoring to predict which of their female customers were pregnant in order to target certain marketing activities to those customers. The New York Times article implies very clearly that those predictions were accurate. But they weren’t. These kinds of things are never accurate. It’s not medical diagnosis; it’s not based on medical information, not based on prescription information. We can assume it’s not based on pregnancy tests. It’s general products that were purchased at Target, such as vitamins and lotions, which are two of the examples which have been disclosed publicly. But in fact, just predicting better than guessing really turns out to be extremely valuable for all kinds of organizations. Saying, ”This customer is three times more likely to respond or to make a purchase.” They still may only be 6 percent out of 100 probability of making the purchase. They still may not make the purchase, but it’s about tipping the balance and playing a numbers game in a different way.

Steven Cherry: So, companies like Netflix and Amazon make predictions about the things that you’ll like, and they do it by comparing you to similar shoppers and knowing what other things those people like. Is this similar to that?

Eric Siegel: Yeah, very much so. I mean, what you’re learning about customers is sort of the types of patterns and things that hold in general about people and their buying behavior relative to what’s known in their profile about what they’ve purchased before, anything demographic that might be known about them, and indeed what the learning process is really discovering is trends across people. And how people can...basically in many methods, many of the modeling methods, it’s basically about grouping or clustering and to sort out birds of a feather, as it were.

Steven Cherry: You were in the natural-language-processing group at Columbia. How important is that to analyzing these databases? And maybe just start by reminding us what natural language processing is.

Eric Siegel: Oh, sure. Well, [with] natural language they’re referring to languages of human beings. So, English and languages spoken across the globe. You know, people these days are throwing around a broad estimate, saying that 80 percent of the data—you know, data which I’ve pointed out here has this value of being an encoding of history—that 80 percent of it is in the form of text. Textual. So again, it’s sort of the difference between academic and commercial nomenclature, is that [the] academic version of predictive modeling is called machine learning, the commercial version of natural language processing is called text mining or text analytics. So when you talk about “textual” or “unstructured”—that is to say, untagged—data, you’re often referring to data that was written by a human being, you know, of course with the intention of conveying information to other human beings, which means certainly it’s very rich, but you need to have special techniques to try to extract its meaning, or what meaning may be in one way or another encoded there. And so certainly, you know, all the magazine articles and blogs and Web pages and tweets and such—you know, there’s a great anecdote that’s gotten a lot of press over the last year, that if you do an analysis of the general mood of the population by looking at all the tweets, you know, all the Twitter feeds going on right now, and you look at the mood of the general population, it turns out to be predictive of the Dow Jones Industrial average. And there are live hedge funds basing their automated decision making on that type of technique.

Steven Cherry: To what extent are stores combining their in-store data with other information? I mean, you know, driver’s license data can be bought from the government, for example, if you’re using that, credit card information, if you’re using that—just briefly.

Eric Siegel: Yeah, no, that’s a great question. So, generally companies accumulate so much information, and it’s their internal information which is either sufficient, or it’s the lion’s share of where they’re going to get value. On the other hand, of course, there’s plenty of cases where people will then purchase external data to augment it. Usually it can be pretty expensive to do that, so it may not be worth it, but the question is so tantalizing: How many people are just sticking with their own internal or going with their external?

Steven Cherry: Eric, some people feel like their privacy is being violated when a store knows more than a few things about them or starts making inferences about them. The New York Times article that we’ve been talking about, that pregnancy example, you know, it was the department store Target that was sending coupons to a regular customer that it had deduced was pregnant and who turned out not only to be pregnant but a teenager, and her father saw these coupons and was, like, you know, why are you getting ads for maternity clothing? And she hadn’t told him yet that she was pregnant. So what are the ethical implications of all this data mining?

Eric Siegel: Well, you know, that’s an unfortunate story for sure, if it’s true; it’s unsubstantiated. And even if it’s true, it’s certainly not clear whether it’s connected to their initiative to predict pregnancy. Target themselves has made it very clear that they understand they don’t want customers to know that they’re considering them possibly pregnant if they haven’t volunteered the information themselves. If indeed that story happened, it was a gaffe and may not have had anything to do with math or analytics. Somebody was put in the wrong marketing bin. That’s a serious gaffe, and I think that they’re paying the price for that—that particular anecdote. More broadly, the ethical issues are extremely difficult to fathom. I think that there’s an inherent tension. People on my side of the fence who are practitioners of this type of analysis simply are going to have an inherent bias toward wanting access to more information. But keep in mind that by performing the analysis, you’re actually doing the opposite of invading an individual’s privacy. You’re not drilling down to look at one person’s records. You’re not drilling down to…excuse me [coughs]...sorry about’re not just drilling down to look at one person’s records, individual record, but it’s the opposite: You’re doing analysis across great numbers of data records in order to ascertain what the general patterns and trends are that hold across large populations of customers. So the problem with privacy is first and foremost what data the company has in the first place, rather than what analysis they’re doing. What they’re allowed to store, who amongst the internal employees have access to that data. Any e-mail provider, right? Gmail, Hotmail, anybody who’s hosting your e-mail or doing anything else in the cloud on your behalf, has that data. Does that mean any of the employees can go look at all the data they have about you? So that’s about data access and data security, and policies around that. Not about analysis. On the other hand, if I peer into my friend’s shopping cart while we’re walking next to each other and I think to myself, Gee, she bought this lotion, and I’ve seen that lotion before...I wonder if my friend is pregnant, have I just committed a thought crime? And if so, are people who are critics of Target basically holding them accountable for what could sort of in certain contexts be considered a thought crime? So I would say, ultimately, at the end, I would have to take an agnostic stance. I think it’s a really difficult question to answer but that we have a lot of work to do in terms of aligning expectations and cultural context, you know, between the consumers and the organizations to try to resolve these issues.

Steven Cherry: You know, the 2004 Spectrum special report that I mentioned at the top of the show was about databases, but it was also about all the sensors that we expect to be deployed in the future. That’s been slow in coming, but is there another revolution in the works when Google and Starbucks and Target know where you are all the time or start to get to know your physical shopping habits?

Eric Siegel: Well, the ability of the device to have the GPS or other types of technology that knows your sense within the telecommunications industry—and I’ve never sat squarely within that industry—but my sense has always been that they know darn well that that’s extraordinarily sensitive stuff.
See, that’s the thing about all this. The data that’s the most valuable is also intrinsically the most sensitive. So there’s inevitably a certain tension that exists and a great need for us as a society to work through the dialogue and figure out what’s sensitive and does it become less sensitive if we can define exactly how it’s used. But you know, something like location is extremely meaningful and important. It’s about relevancy and immediacy, and in many cases and in many ways that can be used to great value and convenience to the end consumer. But it could also be abused as well.

Steven Cherry: Can you give us some examples of how location information might be used?

Eric Siegel: I mean, when you place ads on Google AdWords, you can target the ads by geography, right? I mean, any small business, even a one-person business such as myself. In the case know, I conduct these training seminars in predictive analytics, which take place twice a year in North America, and, you know, I place Google ads that are in the geographical vicinity of, you know, people who are going to view the ads and depending on where the location of the workshop is going to be conducted. And I think the same thing will apply in a much more minute-to-minute real-time, much more granular way. You’re walking by a Starbucks, but in both cases, you know, just like any other marketing endeavor, there’s this balance between spam and, you know, too much content versus value, so…

Steven Cherry: Well, I mean, are we going to see things like you know, eight times in the last year you’ve gone to a movie on a Friday night, it’s a Friday night, you’re at the mall, and suddenly you’ll get a discount offer on your phone, that, you know, a movie, that it also knows exactly what type of movie you tend to go see on a Friday night, that movie is starting in 15 minutes—here’s a dollar-off coupon?

Eric Siegel: I mean, yeah, sure. So that…I mean, again, this isn’t so much about, you know, analysis or math or predictive analytics, but just in general if organizations can know where you are and potentially send you relevant offers—and I can only imagine as well as anybody that there’d be a balance between getting too many offers pushed onto your mobile device too often versus the convenience.

Steven Cherry: Well, Eric, it’s a brave new world of marketing out there. Thanks for walking us through your part of it.

Eric Siegel: Sure, thanks a lot, Steven. It’s a pleasure to speak with you.

Steven Cherry: We’ve been speaking with data-mining expert Eric Siegel about the power of databases to know what you’ll be buying, from whom, and when.
For IEEE Spectrum’s “Techwise Conversations,” I’m Steven Cherry.

Announcer: “Techwise Conversations” is sponsored by National Instruments.

This interview was recorded 27 March 2012.
Segment producer: Barbara Finkelstein; audio engineer: Francesco Ferorelli

Read more Techwise Conversations or follow us on Twitter.

NOTE: Transcripts are created for the convenience of our readers and listeners and may not perfectly match their associated interviews and narratives. The authoritative record of IEEE Spectrum’s audio programming is the audio version.