Top Programming Languages Trends: The Rise of Big Data
Languages like Go, Julia, R, Scala, and even Python are riding the number-crunching wave
Now that IEEE Spectrum is into the third year of annually ranking languages, we can start looking at some trends over time. What languages are ascendant? Which are losing traction? And which of the data sources that we use to create our rankings are contributing the most to these shifts?
In this article I’m going to focus on so-called big-data languages, such as Julia, Python, R, and Scala. Most of these are purpose-built for handling large amounts of numeric data, with stables of packages that can be tapped for quick big-data analytic prototyping. These languages are increasingly important, as they facilitate the mining of the huge data sets that are now routinely collected across practically all sectors of government, science, and commerce.
The biggest mover in this category was Go, an open source language created by Google to help solve the company’s issues with scaling systems and concurrent programming back in 2007. In the default Spectrum ranking, it’s moved up 10 positions since 2014 to settle into 10th place this year. Other big-data languages that saw moves since 2014 in the Spectrum ranking were R and Scala, with R ascending 4 spots and Scala moving up 2 (although down from 2015, when it was up 4 places from its 2014 position). Julia was added to the list of languages we track in 2015, and in the past year it’s moved from rank 40 to 33, still a marginal player but clearly possessing some momentum in its growth.
The chief reason for Go’s quick rise in our ranking is the large increase in related activity on the GitHub source code archive. Since 2014, the total number of repositories on GitHub that list Go as the primary language went up by a factor of more than four. If we look at just active GitHub repositories, then there are almost five times as many. There’s also a fair bit more chatter about the language on Reddit, with our data showing a threefold increase in the number of posts on that site mentioning the language.
Another language that has continued to move up the rankings since 2014 is R, now in fifth place. R has been lifted in our rankings by racking up more questions on Stack Overflow—about 46 percent more since 2014. But even more important to R’s rise is that it is increasingly mentioned in scholarly research papers. The Spectrum default ranking is heavily weighted toward data from IEEE Xplore, which indexes millions of scholarly articles, standards, and books in the IEEE database. In our 2015 ranking there were a mere 39 papers talking about the language, whereas this year we logged 244 papers.
Contrary to the substantial gains in the rankings seen by open source languages such as Go, Julia, R, and Scala, proprietary data-analysis languages such as Matlab and SAS have seen a drop-off: Matlab fell four places in the rankings since 2014 and SAS has fallen seven. However, it’s important to note that both of those languages are still growing; it’s just that they’re not growing as fast as some of the languages that are displacing them.
When we weight the rankings toward jobs, we continue to see heavily used languages like Java and Python dominate. But recruiters are much more interested in R and Scala in 2016 then they were in 2014. When we collected data in 2014, there were only 136 jobs listed for Scala on CareerBuilder and Dice. But by 2016 there was more than a fourfold increase, to 631 jobs.
This growth invites the question whether R can ever unseat Python or Java as the top languages for big data. But while R has seen huge gains over the last few years, Python and Java really are 800-pound gorillas. For instance, we found roughly 15 times as many job listings for pythonistas as for R developers. And while we measured about 63,000 new GitHub repositories in the last year for R, there were close to 458,000 for Python. Although R may be great for visualization and exploratory analysis and is clearly popular with academics writing research papers, Python has significant advantages for users in production environments: It’s more easily integrated into production data pipelines, and as a general purpose language it simply has a broader array of uses.
These data illustrate that despite the desire of some coders to evaluate languages on purely internal merits—the elegance of their syntax, or the degree and nature of the abstractions used—a big driver for a language’s popularity will always be the domains that it targets, either by design or through the availability of supporting libraries.