The computer industry has been busy in recent years trying to figure out how to speed up the calculations needed for artificial neural networks—either for their training or for what’s known as inference, when the network is performing its function. In particular, much effort has gone into designing special-purpose hardware to run such computations. Google, for example, developed its Tensor Processing Unit, or TPU, first described publicly in 2016. More recently, Nvidia introduced its V100 Graphics Processing Unit, which the company says was designed for AI training and inference, along with other high-performance computing needs. And startups focused on other kinds of hardware accelerators for neural networks abound.
Perhaps they’re all making a giant mistake.
That’s the message, anyway, in a paper posted last week on the arXiv pre-print server. In it, the authors, Beidi Chen, Tharun Medini, and
Anshumali Shrivastava of Rice University, argue that the specialized hardware that’s been developed for neural networks may be optimized for the wrong algorithm.
You see, operating these networks typically hinges on how fast the hardware can perform matrix multiplications, which are used to determine the output of each artificial neuron—its “activation”—for a given set of input values. This involves matrix multiplication because each of the values input to a given neuron gets multiplied by a corresponding weight parameter before being summed together—this multiplying and summing being the basic manipulation in matrix multiplication.
The Rice University researchers, like others before them, realized that many of the neurons in a given layer of a neural network have very small activations, meaning that their outputs don’t contribute to what gets computed by subsequent layers. So if you knew which neurons those were, you could just ignore them.
It might seem, though, that the only way to know which neurons in a layer don’t get activated is first to perform all the many matrix multiplications for that layer. The Rice researchers, led by Anshumali Shrivastava, realized that you can, in fact, make this determination more efficiently by looking at the problem very differently. “We treat this as a search problem,” says Shrivastava.
That is, instead of performing all those matrix multiplications to see which neurons become active for a given input, simply look up which they are in a database. The advantage of treating the problem this way is that you can make good use of a general strategy that computer scientists long ago perfected for speeding database lookups: hashing.
Hashing allows you quickly to check whether something is in a database table without having to look sequentially through each entry in turn. Instead, you use some easy-to-compute hash function on the value of interest, which points to where that value would be stored in the database. You can then just check that one location to see if it’s there.
The Rice researchers do something similar for their neural network calculations. A hypothetical example helps to illustrate their approach:
Suppose you created a neural network intended to recognize handwritten digits. Suppose further that the inputs are the grayscale pixel values of a 16 x 16 array, so 256 numbers in all. Let’s imagine that those inputs feed into a single hidden layer of, oh, 512 neurons, whose activations in turn feed into an output layer of 10 neurons, one for each possible digit.
Nets to tables: Before calculating activations for the neurons in the hidden laters, hashes are used to identify which neurons will be active. Here a hash of the input values (H1) is used to look up which neurons in the first hidden layer are relevant—in this case neurons 2 and 4. Another hash (H2) reveals which neurons in the second hidden layer will contribute. This strategy reduces the number of activations that need to be computed. Image: Rice University/arXiv
Training this network is complicated process, but for the moment, let’s gloss over how that is done and say that we have already adjusted the weights on each neuron so that this network performs brilliantly at recognizing handwritten digits. When a legible digit is presented as the input, one of the output neurons (the one corresponding to the digit that was written) will have an activation near 1. The other nine output neurons will have activations near 0. Normally, operation of this network requires one matrix multiplication for each of the 512 hidden neurons plus one for each output neuron—so a lot of multiplying.
The Rice researchers would handle this differently. The first step would be to hash the weights of each the 512 neurons in the hidden layer using what’s called “locality sensitive hashing,” which has the property that similar inputs produce similar hash values. They could then group neurons that had similar hashes, meaning that these neurons must have similar sets of weights. Each group would be stored somewhere in a database table determined by the hash of the input values that would lead those particular neurons to become activated.
Once all that hashing is done, it’s easy to tell which of these hidden neurons will become activated by some new input presented to the network. Just run the 256 input values through the easy-to-compute hash function and use the result to look up in the database which neurons will become activated. This way, you need only compute the activation values of the few neurons that count. You don’t have to compute the activations of all the other neurons in the layer only to discover that they don’t contribute.
Applying an input to this neural network can be thought of as submitting a query to a database, one that says, “Find all neurons that, if things were actually computed, would become active.” Because you are looking things up with hashes, you get an answer fast. You can then just compute the activations for the small number of neurons that actually matter.
The Rice researches implemented this technique, which they call SLIDE (for Sub-LInear Deep learning Engine), for training a neural network, a process that is more computationally difficult than inference. They then compared the performance of their training algorithm with the more traditional approach of training a neural network with a powerful graphics processing unit—in this case on an Nvidia V100 GPU. What they report was pretty stunning: “Our results show that, SLIDE on a modest CPU can be orders of magnitude faster, in wall clock time, than the best possible alternative with the best possible choice of hardware, at any accuracy.”
It’s too early to know whether these results (which have not yet been peer reviewed) will hold up well enough to get chipmakers re-thinking how to design special-purpose hardware for deep learning. But it certainly highlights the danger of committing to a particular kind of hardware when it’s always possible that a new and better algorithm for making neural-network calculations will come along