16 July 2008—As both computer chips and supercomputers grow more powerful by linking together more and more processors, they risk wasting money, energy, and time by sending data among processors over inefficient routes. The amount of time a supercomputer spends shuttling data around can vary dramatically but averages between 10 percent and 30 percent. Now one Stanford engineer says he and his colleagues at supercomputer maker Cray have the most efficient scheme yet for directing all that traffic—an architecture he calls the ”flattened butterfly.”
According to William Dally, chairman of Stanford’s computer science department, the flattened butterfly cuts the cost of building a supercomputer in half, compared with the cost of using a standard architecture known as a Clos network. Dally’s simulations show that in multicore microprocessors, the flattened butterfly can increase data throughput by up to 50 percent over a standard mesh network, reduce power consumption by 38 percent, and cut latency—the time a data packet spends waiting to be forwarded—by 28 percent.
The flattened butterfly is an update of an architecture known as a butterfly, which has been around since the 1960s. The name comes from the pattern of inverted triangles created by the interconnections, which looks like butterfly wings. Dally flattens the butterfly by combining columns of routers and linking each router to more processors. The new configuration halves the number of router-to-router connections. Data traveling between the processors can now get to any other processor in fewer hops, even though the physical route may be longer, and that eliminates considerable latency.
The original butterfly moved data by the most efficient route, but it couldn’t handle a conflict between two packets trying to use the same connection at the same time. The Clos network overcame that problem by having all the packets overshoot their destination—going to a more distant router and then hopping back to the right location. Unfortunately, most of the time the overshoot isn’t necessary. ”The problem with the Clos network is that it takes twice as many hops as you really need,” Dally says.
The flattened butterfly adaptively senses congestion and overshoots only when it needs to. Dally compares it to deciding which road to take when driving from San Jose to Palo Alto. If an online map tells him there’s a lot of traffic on the shorter Route 101, he’ll take the longer Route 280. If traffic is the same, he’ll choose the shorter way. It’s this adaptive routing that makes the flattened butterfly work.
Though most of today’s multicore chips have no more than eight processors and can simply use a bus to communicate, emerging chips with far larger numbers of cores use what’s called a mesh network, which looks like a Manhattan street grid. Packets might go south down an avenue for a block, east for two blocks, then south again for three, with each intersection of street and avenue representing a router. In a flattened butterfly, there are longer wires connecting distant destinations so the data can move between points as the crow flies rather than as a taxi drives. ”Every packet goes through one or two routers instead of 12 or 15,” Dally says.
Timothy Pinkston, who heads a group at the University of Southern California that’s trying to develop more efficient architectures for parallel computers, says the flattened butterfly ”does indeed have significant benefits in terms of efficiency as compared to meshes in the multicore processor context.” For the same cost in network resources, Dally’s design provides higher throughput, lower latency, and reduced power consumption, he says.
Dally presented the flattened butterfly at the International Symposium on Computer Architecture (ISCA) in 2007. Late last month, at this year’s ISCA, Dally's team introduced an updated architecture called Dragonfly, a scalable version for very-high-density networks, including supercomputers with one million nodes. For example, the architecture can group 64 routers connected via a flattened butterfly into one virtual router and then connect that to other virtual routers using another flattened butterfly. The grouped routers are interconnected electrically, via wires, but each virtual router is connected to every other one via an optical link. That means for long routes, the system can take advantage of the speed of optical interconnects, whereas for short routes, it relies on lower-cost electrical connections. Such a setup, says Dally, would be ideal for connecting data centers. In fact, after completing the Dragonfly work one of the authors took a position at Google, whose data centers are thought to be enormous. A Google spokesman says the company won’t discuss specifics about its infrastructure.
Steve Scott, chief technology officer at Cray, is also a coauthor on the Dragonfly paper. He says Cray has worked with Dally’s group before and has integrated some of his research into their designs, but he can’t say yet whether Cray will adopt the flattened butterfly or the Dragonfly.
Dally continues to work on better algorithms for sensing and balancing the traffic across various routes but says that until someone figures out something better, the flattened butterfly is the ideal architecture for tying together computing networks, whether on a single chip, in a supercomputer, or across a computer cluster. ”I think this is going to see quite wide use,” he says.