# MRAM-like Device Could Make Logic Run Backwards

## Magnetic tunnel junctions like the ones that form MRAM could form invertible logic—multiplication circuits that factor or adders that subtract

Engineers at Purdue University and the University of California at Berkeley have discovered that a simple combination of commonly available devices creates a computing element that could lead to some truly strange circuits: logic that can perform its inverse operation. Some important aspects of modern computing—notably encryption—depend on there being a significant difference in difficulty between multiplication and its inverse, factoring. But the engineers are cautious about the technology’s potential for code breaking. They describe the device in this month’s issue of IEEE Electron Device Letters.

Purdue electrical engineering and computer science professor Supriyo Datta and his colleagues had been exploring new ways of computing using simulated networks of devices that have a “tunable” randomness. That is, the devices output a string of bits that are random, but can be tuned to produce more of one bit than the other. Such a device, which they called a p-bit,  could then represent a probability—40 percent say—as a string of bits of whatever length as long as for every 100 bits or so 40 of them were 1.

Many computing problems can be streamlined by computing with probabilities; Datta’s group showed their use in solving the travelling salesman problem and other important problems of optimization and inference. But Datta and post-doctoral researcher Kerem Yunus Camsari and some colleagues discovered something new about using p-bits: They could be combined in such a way that they form logic and arithmetic circuits that can perform their inverse operations—adders can subtract, multipliers can factor, and potentially stranger things can happen.

The inputs to such a logic gate would be random streams of bits, but they still obey the logic of that gate. So whenever a 2-input AND gate saw a 1 on both inputs, it would output a 1; everything else would generate a 0. But what p-bit-based gates can do that others can’t is run in reverse. Put a stream of random bits through what was the AND gate’s output and its input will produce two random streams that satisfy the AND gate’s logic:  Whenever you put a 1 on the output, both inputs would generate a 1. But if you put a 0 on the output the inputs generate a random sequence of 00, 01, and 10.

The problem was this: there was no such thing as a p-bit. “Six-months ago I’d have said we’re not close to building even one, because we’d have to build a whole new device,” says Datta.

Sayeef Salahuddin at UC Berkeley suggested to Datta and Camsari that a magnetic RAM cell combined with a transistor might be the solution. Datta and Camsari have now shown it to be true. Just those combination of things recently became available on processors and systems-on-a-chip from GlobalFoundries and TSMC.

An MRAM cell is basically a two-terminal nanoscale device called a magnetic tunnel junction. It’s made up of two ferromagnetic layers sandwiching a non-ferromagnetic layer. If the two magnetic layers have magnetic fields oriented in the same direction, current can tunnel across them with little resistance. If they’re pointing in the opposite directions from each other, the resistance spikes. MRAM stores a bit by making one of the layers changeable by a particular type of current. They’ve now been engineered so well that the flippable magnetic layer will stay stable for a decade or more.

To make a p-bit, however, you need to basically make a bad MRAM cell. Instead of needing a high current to flip the magnetic layer, you need to make it so unstable that random thermal fluctuations will knock its magnetic field back and forth at gigahertz rates. The transistor component of the p-bit is there to pull the output towards either the one or zero states as needed—tuning the randomness to encode a probability.

Datta’s group hasn’t built p-bits yet, but they’ve simulated them and are seeking access to a foundry that can do the job.

The Conversation (0)

## 3 Ways 3D Chip Tech Is Upending Computing

### AMD, Graphcore, and Intel show why the industry’s leading edge is going vertical

Vertical
Intel; Graphcore; AMD
DarkBlue1

A crop of high-performance processors is showing that the new direction for continuing Moore’s Law is all about up. Each generation of processor needs to perform better than the last, and, at its most basic, that means integrating more logic onto the silicon. But there are two problems: One is that our ability to shrink transistors and the logic and memory blocks they make up is slowing down. The other is that chips have reached their size limits. Photolithography tools can pattern only an area of about 850 square millimeters, which is about the size of a top-of-the-line Nvidia GPU.

For a few years now, developers of systems-on-chips have begun to break up their ever-larger designs into smaller chiplets and link them together inside the same package to effectively increase the silicon area, among other advantages. In CPUs, these links have mostly been so-called 2.5D, where the chiplets are set beside each other and connected using short, dense interconnects. Momentum for this type of integration will likely only grow now that most of the major manufacturers have agreed on a 2.5D chiplet-to-chiplet communications standard.

But to sling truly huge volumes of data around as if it were all on the same chip, you need even shorter and denser connections, and that can be done only by stacking one chip atop another. Connecting two chips face-to-face can mean making thousands of connections per square millimeter.

It’s taken a lot of innovation to get it to work. Engineers had to figure out how to keep heat from one chip in the stack from killing the other, decide what functions should go where and how they should be manufactured, keep the occasional bad chiplet from leading to a lot of costly dud systems, and deal with the resulting added complexities of figuring all that out at once.

Here are three examples, ranging from the reasonably straightforward to the confoundingly complicated, that show where 3D stacking is now:

### AMD’s Zen 3

AMD's 3D V-Cache tech attaches a 64-megabyte SRAM cache [red] and two blank structural chiplets to the Zen 3 compute chiplet.

AMD

PCs have long come with the option to add more memory, giving extra-large applications and data-heavy work greater speed. Thanks to 3D chip stacking, AMD’s next-generation CPU chiplet, comes with that option, too. It’s not an aftermarket add on, of course, but if you’re looking to build a computer with some extra oomph, ordering up a processor with an extra-large cache memory could be the way to go.

Even though Zen 2 and the new Zen 3 processor cores are both made using the same Taiwan Semiconductor Manufacturing Corp. manufacturing process—and therefore have the same size transistors, interconnects, and everything else—AMD made so many architectural alterations that even without the extra cache memory, Zen 3 provides a 19 percent performance improvement on average. One of those architectural gems was the inclusion of a set of through-silicon vias (TSVs), vertical interconnects that burrow straight down through most of the silicon. The TSVs are built within the Zen 3’s highest-level cache, blocks of SRAM called L3, which sits in the middle of the compute chiplet and is shared across all eight of its cores.

In processors destined for data-heavy workloads, the Zen 3 wafer’s backside is thinned down until the TSVs are exposed. Then a 64-megabyte SRAM chiplet is bonded to those exposed TSVs using what’s called hybrid bonding—a process that’s like cold-welding the copper together. The result is a dense set of connections that can be as close together as 9 micrometers. Finally, for structural stability and heat conduction, blank silicon chiplets are attached to cover the remainder of the Zen 3 CPU die.

Adding the extra memory by setting it beside the CPU die was not an option, because data would take too long to get to the processor cores. “Despite tripling the L3 [cache] size, 3D V-Cache only added four [clock] cycles of latency—something that could only be achieved through 3D stacking,” John Wuu, AMD senior fellow design engineer, told attendees of the IEEE International Solid State Circuits Conference.

The bigger cache made its mark in high-end games. Using the desktop Ryzen CPU with 3D V-Cache sped games rendered at 1080p by an average of 15 percent. It was good for more serious work as well, shortening the run time for a difficult semiconductor design computations by 66 percent.

The industry’s ability to shrink SRAM is slowing compared to how well it can shrink logic, Wuu pointed out. So you can probably expect future SRAM expansion packs to continue to be made using more established manufacturing processes while the compute chiplets are driven down to Moore’s Law’s bleeding edge.

### Graphcore’s Bow AI Processor

The Graphcore Bow AI accelerator uses 3D chip stacking to boost performance by 40 percent.

Graphcore

3D integration can speed computing even if one chip in the stack doesn’t have a single transistor on it. United Kingdom–based AI computer company Graphcore managed a huge increase to its systems’ performance just by attaching a power-delivery chip to its AI processor. The addition of the power-delivery silicon means the combined chip, called Bow, can run faster—1.85 gigahertz versus 1.35 GHz—and at lower voltage than its predecessor. That translates to computers that train neural nets up to 40 percent faster with as much as 16 percent less energy compared to its previous generation. Importantly, users get this improvement with no change to their software at all.

The power-management die is packed with a combination of capacitors and through-silicon vias. The latter are just to deliver power and data to the processor chip. It’s the capacitors that really make the difference. Like the bit-storing components in DRAM, these capacitors are formed in deep, narrow trenches in the silicon. Because these reservoirs of charge are so close to the processor’s transistors, power delivery is smoothed out, allowing the processor cores to run faster at lower voltage. Without the power-delivery chip, the processor would have to increase its operating voltage above its nominal level to work at 1.85 GHz, consuming a lot more power. With the power chip, it can reach that clock rate and consume less power, too.

The manufacturing process used to make Bow is unique but not likely to stay that way. Most 3D stacking is done by bonding one chiplet to the other while one of them is still on the wafer, called chip-on-wafer [see “AMD’s Zen 3” above]. Instead Bow used TSMC’s wafer-on-wafer, where an entire wafer of one type is bonded to an entire wafer of the other, then diced up into chips. It’s the first chip on the market to use the technology, according to Graphcore, and it led to a higher density of connections between the two dies than could be achieved using a chip-on-wafer process, according to Simon Knowles, Graphcore chief technical officer and cofounder.

Although the power-delivery chiplet has no transistors, those might be coming. Using the technology only for power delivery “is just the first step for us,” says Knowles. “It will go much further than that in the near future.”

[More here]

### Intel’s Ponte Vecchio Supercomputer Chip

Intel’s Ponte Vecchio processor integrates 47 chiplets into a single processor.

Intel

The Aurora supercomputer is designed to become one of the first U.S.-based high-performance computers (HPCs) to pierce the exaflop barrier—a billion billion high-precision floating-point calculations per second. To get Aurora to those heights, Intel’s Ponte Vecchio packs more than 100 billion transistors across 47 pieces of silicon into a single processor. Using both 2.5D and 3D technologies, Intel squeezed 3,100 square millimeters of silicon—nearly equal to four Nvidia A100 GPUs—into a 2,330-square-millimeter footprint.

Intel Fellow Wilfred Gomes told engineers virtually attending the IEEE International Solid State Circuits Conference that the processor pushed Intel’s 2D and 3D chiplet-integration technologies to the limits.

Each Ponte Vecchio is really two mirror-image sets of chiplets tied together using Intel’s 2.5D integration technology Co-EMIB. Co-EMIB forms a bridge of high-density interconnects between two 3D stacks of chiplets. The bridge itself is a small piece of silicon embedded in a package’s organic substrate. The interconnect lines on silicon can be made twice as densely as those on the organic substrate. Co-EMIB dies also connect high-bandwidth memory and an I/O chiplet to the “base tile,” the largest chiplet upon which the rest are stacked.

The base tile uses Intel’s 3D stacking technology, called Foveros, to stack compute and cache chiplets atop it. The technology makes a dense array of die-to-die vertical connections between two chips. These connections can be 36 micrometers apart from short copper pillars and a microbump of solder. Signals and power get into this stack by means of through-silicon vias, fairly wide vertical interconnects that cut right through the bulk of the silicon.

Eight compute tiles, four cache tiles, and eight blank “thermal” tiles meant to remove heat from the processor are all attached to the base tile. The base itself provides cache memory and a network that allows any compute tile to access any memory.

Needless to say, none of this was easy. It took innovations in yield management, clock circuits, thermal regulation, and power delivery, Gomes said. For example, Intel engineers chose to supply the processor with a higher-than-normal voltage (1.8 volts) so that current would be low enough to simplify the package. Circuits in the base tile reduce the voltage to something closer to 0.7 V for use on the compute tiles, and each compute tile had to have its own power domain in the base tile. Key to this ability were new high-efficiency inductors called coaxial magnetic integrated inductors. Because these are built into the package substrate, the circuit actually snakes back and forth between the base tile and the package before supplying the voltage to the compute tile.

It’s taken 14 years to go from the first petaflop supercomputer in 2008 to exaflops this year, Gomes said. Advanced packaging, such as 3D stacking, is among the technologies that could help shorten the next thousandfold computing improvement to just six years, Gomes told engineers.

[More here]

### 3D Technologies

Hybrid bonding starts by forming recessed copper ports at the top “face” of the chip [top]. The surrounding oxide dielectric is chemically activated, so when the two chips are pressed together at room temperature they instantly bond [middle]. The bound chips are annealed, expanding the copper to form a conductive connection [bottom].

Hybrid bonding binds copper pads at the top of a chip’s interconnect stack directly to the copper pads on a different chip. In hybrid bonding, the pads are in small recesses, surrounded by oxide insulator. The insulator is chemically activated and instantly bonds when pressed to its opposite at room temperature. Then, in an annealing step, the copper pads expand and bridge the gap to form a low-impedance link.

### Density per millimeter

 3D microbumps 3D hybrid bonding 400–1,600 >10,000

Hybrid bonding offers a high density of connections—in the range of 10,000 bonds per square millimeter, many more than in microbump technology, which offers about 400–1,600/mm2 [chart, above]. The pitch—the distance from the edge of one interconnect to the far edge of the next—achievable today is about 9 micrometers, but tighter geometries are in the works. The technology will likely hit its limits around a pitch of 3 µm or so, says Lihong Cao, director of engineering and technical marketing at packaging-technology company ASE Group. The most critical step to improve hybrid bonding is keeping wafers from warping and reducing the surface roughness of each side to nanometer-scale perfection, she says.

Through-silicon vias [pillars, bottom half] and hybrid bonding [rectangles, center] connect an AMD compute chiplet [bottom] to an SRAM chiplet [top].

Intel

Microbumpsare essentially an extremelyscaled-down version of a standard packaging technology called flip-chip.In flip-chip, bumps of solder are added to the end points of theinterconnects at the top (face) of a chip. The chip is then flipped onto a package substrate with a matching set of interconnects, and the solder is melted to form a bond. To stack two chips with this technology, first onechip must have short copper pillars built so that they protrude from the surface. These are then capped with a “microbump” of solder, and the two chips are joined face-to-face by the meltingthe solder.

The minimum distance from the start of one connection to the far edge of the next, the pitch, can be less than 50 micrometers when using microbumps. Intel used a 36-µm-pitch version of its Foveros 3D integration technology in the Ponte Vecchio supercomputer chip. Samsung says its microbump technology, called 3D X-Cube, is available with a 30-µm pitch. The technology cannot match hybrid bonding’s density [above]. However, its requirements for alignment and planarization are not as strict as hybrid bonding’s, making it easier to stack multiple chips that are made using different manufacturing technologies onto a single base chip.

Intel’s Ponte Vecchio uses microbumps [short pillars, top left] to connect a compute chiplet to a base die. Through silicon vias [thin vertical lines, center] send signals from both the base die and the compute tile to the package. The base die connects horizontally to a second base die using a silicon bridge [horizontal bar, right].

Intel

Through silicon-vias (TSVs) are interconnects that descend vertically down through a chip’s silicon. They don’t extend through a wafer’s entire depth, so the backside of the silicon has to be ground away until the TSV is exposed. They are often necessary in 3D stacked chips because the chips are bonded together so that their interconnects are face-to-face. In that case, the TSVs provide the stack with access to power and data. They have been in wide use for years in high-bandwidth dynamic RAM memory, which stacks multiple memory chips vertically. But with 3D chip stacking, the technology has come to logic chips, too.

Keep Reading ↓Show less
{"imageShortcodeIds":[]}