New System Could Break Bottleneck in Microprocessors

Turning burdensome software into simple hardware speeds core-to-core communication in microprocessors

2 min read
A close up of the Haswell-EX Xeon E7-8890 V3 multicore processor chip shows 18 cores.
Intel and other multicore processor makers want cores to be able to communicate with each other faster. More cores (the Haswell-EX Xeon E7-8890 V3 shown here has 18) typically means much more time coordinating communications.
Photo: Intel

Engineers at North Carolina State University and at Intel have come up with a solution to one of the modern microprocessor’s most persistent problems: communication between the processor’s many cores. Their answer is a dedicated set of logic circuits they call the Queue Management Device, or QMD. In simulations, integrating the QMD with the processor’s on-chip network, at a minimum, doubled core-to-core communication speed, and in some cases, boosted it much farther. Even better, as the number of cores was increased, the speed-up became more pronounced.

In the last decade, microprocessor designers started putting multiple copies of processor cores on a single die as a way to continue the rate of performance improvement computer makers had enjoyed without chip-killing hot spots forming on the CPU. But that solution comes with complications. For one, it meant that software programs had to be written so that work was divided among processor cores. The result: Sometimes different cores would need to work on the same data or have to coordinate the passing of data from one core to another.

“We have to improve performance by improving energy efficiency. The only way to do that is to move some software to hardware”

To prevent the cores from wantonly overwriting each other’s information, processing data out of order, or committing other errors, multicore processors use lock-protected software queues. These are data structures that coordinate the movement of and access to information according to software-defined rules. But all that extra software comes with significant overhead, which only gets worse as the number of cores increases. “Communications between cores is becoming a bottleneck,” says Yan Solihin, a professor of electrical and computer engineering who led the work at NC State.

The solution—born of a discussion with Intel engineers and executed by Solihin's student, Yipeng Wang, at NC State and at Intel—was to turn the software queue into hardware. This effectively turned three multistep software queue operations into three simple instructions—add data to the queue, take data from the queue, and put data near where it’s going to be needed next. Compared with just using the software solution, the QMD sped up a sample task such as packet processing—like network nodes do on the Internet—by a greater and greater amount the more cores were involved. For 16 cores, QMD worked 20 times as fast as the software could.

Once they realized this result, the engineers reasoned that the QMD might be able to do a few other tricks—turning more software into hardware. They added more logic to the QMD and found that it could speed up several other core communications-dependent functions, including MapReduce, a technology Google pioneered for distributing work to different cores and collecting the results.

They aren’t done yet. “The next step is to figure out other types of hardware accelerators that would be useful,” says Solihin. “We have to improve performance by improving energy efficiency. The only way to do that is to move some software to hardware. The challenge is to figure out which software is used frequently enough that we could justify implementing it in hardware. There is a sweet spot,” he says.

Intel engineer Ren Wang is presenting the QMD speed-up results at the 25th Annual Conference on Parallel Architectures and Compilation Techniques, in Haifa, Israel this week.

The Conversation (0)

3D-Stacked CMOS Takes Moore’s Law to New Heights

When transistors can’t get any smaller, the only direction is up

10 min read
An image of stacked squares with yellow flat bars through them.
Emily Cooper
Green

Perhaps the most far-reaching technological achievement over the last 50 years has been the steady march toward ever smaller transistors, fitting them more tightly together, and reducing their power consumption. And yet, ever since the two of us started our careers at Intel more than 20 years ago, we’ve been hearing the alarms that the descent into the infinitesimal was about to end. Yet year after year, brilliant new innovations continue to propel the semiconductor industry further.

Along this journey, we engineers had to change the transistor’s architecture as we continued to scale down area and power consumption while boosting performance. The “planar” transistor designs that took us through the last half of the 20th century gave way to 3D fin-shaped devices by the first half of the 2010s. Now, these too have an end date in sight, with a new gate-all-around (GAA) structure rolling into production soon. But we have to look even further ahead because our ability to scale down even this new transistor architecture, which we call RibbonFET, has its limits.

Keep Reading ↓Show less
{"imageShortcodeIds":[]}