Steven Cherry: Hi, this is Steven Cherry for IEEE Spectrum’s “Techwise Conversations.”
In a recent podcast, I was surprised to learn there were 93 000 data scientists registered with Kaggle, the site that creates competitions among them and helps award freelance contracts. I’m not the only one. The article in The Atlantic that brought Kaggle to our attention had a parenthetical exclamation: “Who knew there were that many data scientists in the world!”
The next obvious question is, How do I get myself in on the lucrative area of data science? As it happens, the trusty New York Times wrote about that back in April. Its premise was, universities offer courses in a hot new field, data science. Data science sure is hot. Is it new? Sort of.
The Times article quoted an adjunct professor at Columbia University, who described a data scientist as “a hybrid computer scientist/software engineer/statistician.” And it says that this fall, “Columbia will offer new master’s and certificate programs heavy on data. The University of San Francisco will soon graduate its charter class of students with a master’s in analytics. Other institutions teaching data science [JPEG] include New York University, Stanford, Northwestern, George Mason, Syracuse, UC Irvine, and Indiana University.”
My guest today, also with Columbia University, is Chris Wiggins, a professor of applied mathematics there. He’ll walk us through what’s old and new in the academic field of data science. He joins us by phone.
Chris, welcome to the podcast.
Chris Wiggins: Thanks, Steven, for having me.
Steven Cherry: So what’s meant by a “master’s program heavy on data”?
Chris Wiggins: Well, in general, in data science, part of what makes you a data scientist is your ability to deal with real-world data. So normally in the classroom, if you picture professing the way it’s been done for several centuries, there’s somebody at the front of the room, and they talk to you, maybe you write down what they said, and a few months later you repeat it back to them. We’re hoping not to do that in terms of a data-science education. We’re hoping to incorporate a lot of dealing with real data, and you can’t deal with real data without understanding computation. So the pedagogy will be coupled to some very applied work in applied statistical computation.
Steven Cherry: So how is this different from statistics or applied math?
Chris Wiggins: Part of it is historical. I think within statistics, in particular, you see this very strong, almost heretical, trend ever since World War II of statisticians saying, “We really need to be less about mathematical statistics and more about applied statistics and really trying to understand real-world data.” You can see this in writings of John Tukey, who was a mathematical statistician, who really wrote sort of a call-to-arms paper in 1962 advocating a much more data-driven branch of statistics be created.
In terms of applied mathematics, similarly, post–World War II applied mathematics was largely about partial differential equations and solving them numerically. This field, data science, is more about trying to deal with real-world data than most of applied mathematics has been, in part because most of applied mathematics has been about “How do we solve models which we understand?” And data science is often about problems where the real world hands you data, but the real world does not hand you a particular model to use as the way we understand those data.
Steven Cherry: So where do the engineers in our listenership fit in here? Would they have no trouble getting the certificate? Would it be a waste of time? Should they get the master’s degree?
Chris Wiggins: I wouldn’t say it’s a waste of time. The choice between certification and master’s is in part about your eventual career trajectory and about how many hours and how many years you want to spend on the process. You know, a master’s program is usually a three-semester program. A certification can be as little as, you know, in our case it’s four classes—it’s four new classes that are specifically engineered around data science.
The particular skill sets that come to be useful in data science include computation, mathematical modeling, and some understanding of how to build mathematical models of the real world. That’s why you see data scientists coming from fields as broad as computer science, astronomy, physics, applied mathematics—they come from many different trainings.
Steven Cherry: The adjunct professor that described the data scientist as a hybrid computer scientist/software engineer/statistician is herself a research scientist from industry. Is that something you’re looking for?
Chris Wiggins: Absolutely. You know, in engineering in particular, and I am a professor at an engineering school, a large part of the goal is to build and to make. So engineering education in the United States over the last century has benefited from a very close interplay with practitioners, and often the people who are dealing with things in practice are using technologies that are not yet widespread in academia, but will be. So part of the best thing that engineering faculty can do is to be literate in the technologies that will become textbook material in the coming decades. Fortunately for us, New York City is a fantastic place for that. New York City has an abundance of tech companies, data oriented start-ups. Even in the traditional verticals like advertising or publishing, these fields themselves have become incredibly data driven, so New York City is a fantastic environment for an engineering education in data science.
Steven Cherry: Yeah, I was going to point that out. Both Columbia and New York University here in Manhattan are jumping on this bandwagon, and Google, Microsoft, and Yahoo all have research labs here. Twitter has a presence. How much of this is being driven by the Web and social media?
Chris Wiggins: In New York City, that’s one of the things that’s very strong. So New York City has traditionally been a strong home of media, but also advertising, and that’s closely coupled to social media. So that’s one particular economic area that’s really strong in New York City.
Steven Cherry: New York’s also the financial center of the United States. Do you see a lot of students possibly going on to Wall Street or maybe the other way—students coming from Wall Street to ramp up their data-science skills?
Chris Wiggins: These days, in 2013, it’s more the latter. So, I’ve been on the faculty at Columbia for a long time—well, since 2001—and that economy from 2001 to 2008, that sort of rapidly and consistently growing sort of Madoff-like economy, just getting higher and higher economic indicators, was a time when many of the students that I taught went down to Wall Street after they graduated from Columbia, and that was sort of the dominant narrative about what you do with quantitative training in New York City at that time.
Things have changed dramatically in the last five years, and they’re changing faster and faster all the time. So with the economic upheavals of 2008, and in the quantitative world, there were some indicators even before that, in 2007. People started thinking more seriously about their opportunities post-Columbia, and more and more of those people every year started going into start-ups. And now this is a well-documented trend of people leaving finance, either to finance start-ups, or people going from financial services into venture capital or other ways of investing in start-ups, or technologists leaving financial services to go and create their own start-ups or to join excellent start-ups that we have here in New York City that need their skills.
Steven Cherry: There’s quite a lot of data coming from the government now with the promise of much more to come. We had on the show recently Jim Hendler of Rensselaer Polytechnic Institute, which is also in New York state, upstate. He plans to harness IBM’s Watson software to link government databases to one another and to mine them. That’s a computer science application specifically in artificial intelligence. Is AI a big part of this, or will it be?
Chris Wiggins: So AI certainly is a historical part of machine learning, which is a field that is closely related to data science. Many of the tasks in machine learning are of the form of artificial intelligence, that is, trying to get a computer to do the kind of thing that a human being might do if she or he were infinitely patient and were labeling data and things like that.
But many of the things that you do in data science are not like that at all in terms of, you know, exploratory data analysis, for example, is the kind of thing that you do which really requires you to provide an interpretation of data, which is difficult to turn into an algorithm. And also, many of the tools that are used in data science are sort of recognized as coming from machine learning or statistics than from traditional AI. I should say, though, that if you look at AI, machine learning, data science, applied mathematics, computer science, all of these fields have nonzero overlaps. So part of what’s happening in data science is a very selective choosing of tools from these different fields that turn out to be useful.
Steven Cherry: The show we had about Kaggle suggested that the field is changing even as people are studying to enter it. For one thing, it may be more freelance-driven than it used to be. Do you see that happening?
Chris Wiggins: Actually, yes. I do know many people who have a strong academic training and are doing freelance work in data science, and I don’t know if that’s sustainable or transient or, I mean, in the applied mathematics world, for example, you know, part of being a broad applied mathematician is that you have a set of tools that are useful in many different fields and you can publish in many different fields.
In statistics this is true as well. John Tukey famously said the great thing about being a statistician is you get to play in everybody else’s backyard. The tool set of data science is extremely broad, and so it’s precisely the kind of thing that you can imagine somebody doing as a freelance way of life, in which you work on different problems.
And, again, this is in many ways not novel. Again, coming back to John Tukey, he did plenty of traditional mathematical statistics, but he also did lots of consulting for technology companies, educational testing service in Princeton, the United States government. I mean, statistics and statistical analysis of data are tool sets that really lend themselves to application in a broad variety of fields.
Another sort of anointed name in applied computational statistics is Leo Breiman, who, like John Tukey, was trained as a mathematical statistician, left academia for several years and just worked on freelance and consulting projects, and came back to academia, where he was a statistics professor at Berkeley—one of the oldest and most venerated statistics departments in the States.
Steven Cherry: So what’s your picture of a career as a data scientist, say, five years from now?
Chris Wiggins: Well, I think five years from now there will be much more of a credentialing process in place. People will have much more agreement on what it means to have a background in data science. The funding situation is constantly improving, so from the perspective of an academic, doing research where you are advancing both machine learning and a domain expertise at the same time, which is another way of saying applications of data science in the natural sciences, has been difficult to fund relative to simply proposing to do things that are already recognized as core skills of that field. I think that that is changing all the time, and there’s more and more funding opportunities in academia for faculty and graduate students to apply data science in the natural sciences.
I think more and more companies are going to think more seriously about how to use data science as part of their core product. So there’s a feeling now, there’s attention by the press to data in product and data in companies, so there’s a sense now that we should do something with data, and I think in another five years it will be much more clear what is the something that people in different industries should be doing with data to try to put data science to work. It’s still sort of early days, though, for people to say, you know, “What is a data science training?” and “How is data science different from both existing fields in academia and existing jobs within industry? How is data science different from business intelligence or other fields that you might see in industry?”
Steven Cherry: So just to sum up, do you see data science as a good career move for a certain kind of engineer?
Chris Wiggins: Yeah, absolutely. So, I will agree with what Jim Gray of Microsoft said years ago, which is that every field of human endeavor is being progressively transformed by abundant data. So both because there are more data available, the data are cheaper to store and transmit, and there’s easy access to data, and that’s changing every field of human endeavor, both in the academy and industry. So fluency in data science is an actual investment.
Steven Cherry: Well, very good, Chris. You know, my B.A. is in mathematics, half of it, actually, so I think if this had existed back then, I would have been very attracted to it. Thanks for being in the forefront of it, and thanks for joining us today.
Chris Wiggins: Sure. Thank you for making the time and having me on.
Steven Cherry: We’ve been speaking with Columbia University professor of applied mathematics Chris Wiggins about the emergence of data science as a distinct academic discipline.
For IEEE Spectrum’s “Techwise Conversations,” I’m Steven Cherry.
This interview was recorded Tuesday, 21 May 2013.
Segment producer: Barbara Finkelstein; audio engineer: Francesco Ferorelli
Read more “Techwise Conversations,” find us in iTunes, or follow us on Twitter.
NOTE: Transcripts are created for the convenience of our readers and listeners and may not perfectly match their associated interviews and narratives. The authoritative record of IEEE Spectrum’s audio programming is the audio version.
To Probe Further
IBM’s Watson Tries to Learn…Everything What happens when Watson learns a million databases? RPI students and faculty hope to find out
Data Science Is Now a Job Market Based Entirely on Merit A start-up ranks data scientists and creates competitions between them for specific consulting projects
Feeding the World With Big Data Agriculture experts say that open data could lift people out of poverty
IT Has 26 Words for Data Mining As data proliferate, so do words for handling them (Technically Speaking, IEEE Spectrum, December 2011)