A sneak peek inside the handheld of the future
Like the idea of a handheld device that can be any of 10 different gizmos, depending on your mood? You could soon have it if the ideas described in this two-part report become reality. In this article, Diederik Verkest of the Interuniversity MicroElectronics Center in Leuven, Belgium, describes the chameleon-like handheld his lab has been working on.
By year-end 2005, U.S. consumers will have trashed some 130 million cellphones and another mountain of old PDAs, MP3 players, and game consoles. We could, of course, build bigger landfills to accommodate the billions of obsolete gizmos we throw away each year. But here's a much better idea: building a wireless multimedia device whose hardware and software can be easily altered or upgraded so it never becomes obsolete. When a new communications standard or multimedia format comes along, the device could be made to conform to it simply by downloading circuit and software modifications.
1203pda01.jpg Let's say your Wi-Fi-enabled device supports IEEE 802.11b but not IEEE 802.11g? No problem. Just download the new standard and have your device reconfigure itself. Fixing bugs and installing security patches to guard against the virus du jour would be just as easy.
That's the vision we're working toward at Interuniversity MicroElectronics Center (IMEC) in Leuven, Belgium, as we develop a multimedia platform that users can not only upgrade, but reconfigure into a PDA, an MP3 player, a game console, or a cellphone. With one handheld device, then, you could surf the Web, transfer documents, listen to music, watch a movie, play a game, or make a phone call, switching modes as the mood strikes.
We're in the final year of project Gecko, a three-year research effort to develop a prototype of such an all-purpose device. We've been working with four commercial partners, including Xilinx Inc. (San Jose, Calif.), a maker of reconfigurable field-programmable gate arrays (FPGAs), and IC-maker Infineon Technologies AG (Munich).
Gecko, named after a colorful genus of lizard, consists of three basic elements. It has an FPGA to provide high performance for certain applications, fixed application-specific integrated circuits (ASICs) to handle functions like wireless communication, and a microprocessor to run software applications and the real-time operating system that manages the whole show.
Ultimately, we want to shrink this prototype into a commercial system that will fit on a single, custom chip suitable for use in consumer devices, a goal we think we can achieve in five to seven years.
Right now, the computational power you need to play games, watch movies, listen to music, or make phone calls can be met only by ASICs. Unfortunately, while you can reach the speed you need with an ASIC, you cannot use a single ASIC to perform all these different functions. General-purpose microprocessors can provide a degree of flexibility by running some functions as software, but that flexibility comes at the expense of lower speed and higher power dissipation than with ASICs.
Enter the field-programmable gate array. FPGAs fill the gap between custom, high-speed, and low-power ASICs and flexible, lower-speed, and higher-power microprocessors. At IMEC, we're using reconfigurable FPGAs to handle the high speeds that a fast and flexible multimedia platform requires. Field programmability means the logic function of the device can be modified after the device is manufactured, enabling a whole new species of gadgets that combine the flexibility of software with the speed and power advantages of hardware.
In effect, an FPGA is an IC consisting of an array of programmable logic cells [see ”Go Reconfigure”]. Each cell can be configured to perform any logic function of, typically, four inputs and one output. The logic cells are connected using configurable, electrically erasable, static random-access memory (SRAM) cells that change the FPGA's interconnection structure.
Just as a software program determines the functions executed by a microprocessor, the configuration of an FPGA determines its functionality. Novel functions can be programmed into it by downloading a new configuration, similar to the way a microprocessor can be reprogrammed by downloading new software code. However, in contrast to a microprocessor's functions, the FPGA's run in hardware; there is none of software's relatively slow instruction fetch-decode-execute cycle. The result is higher speed and lower power dissipation for FPGAs than for microprocessors.
While FPGAs have been around for more than 20 years, only recently have they come with the equivalent of up to 10 million gates. This makes it possible to run several applications, such as video processing and display, games, and digital cellphone functions all at once. Our research has focused on reconfiguring part of the FPGA on the fly—for example, as it is executing a program that runs a modem function to handle an incoming call—while other parts of it perform other functions, such as decoding texture for a surface in a three-dimensional game.
Reconfigurable hardware by itself, however, cannot provide the best solution for all situations. A flexible, future-proof platform must contain multiple microprocessors, ASICs, and memories cheek by jowl with an FPGA. Software tasks run either on the microprocessors or directly in FPGA hardware, with a specially tuned real-time operating system managing the resources.
Our Gecko prototype relies on a Compaq iPAQ pocket PC that displays video, games, and other applications and accepts user input. The iPAQ is connected to a generic prototyping board that hosts two Xilinx Virtex-II FPGAs, both clocked at 30 MHz [see figure, "Prototype of the Gecko," above].
The FPGA nearest the iPAQ is dynamically reconfigurable to host various applications, such as game and video decoders, that we want to accelerate in hardware. It boasts the equivalent of six million logic gates. The other FPGA has three million logic gates and is used to implement auxiliary functions, such as the clock control and the reconfiguration protocol for reconfiguring the first FPGA. We leave this second FPGA alone to run these applications and never reconfigure it.
The board connects to the expansion bus on the iPAQ, which runs on a 200-MHz Intel StrongARM SA-1110 processor, with 64MB of RAM and 32MB of flash memory. The StrongARM hosts the operating system that runs the communications stack and orchestrates the maneuvering of applications from flash memory into the reconfigurable FPGA and back again.
One instance where FPGA hardware acceleration comes into play is when the user wants to view an application at maximum frame rate (25 frames per second) and maximum resolution (320 by 240 pixels for the iPAQ screen). At the user's command, the FPGA takes over the video decoding calculations that were running on the StrongARM processor, speeding the video decoding from six frames per second to 25 frames per second, with lower power dissipation.
Tasks can be moved from FPGA hardware to processor software and vice versa, depending on requirements. Let's say you're in your living room, and your infant is asleep upstairs in her crib. You're watching streaming video from a Web cam in the baby's room, and the FPGA is running a video decoding task. Now you want to play a 3-D game while keeping an eye on the baby. But for high-quality play, the game requires hardware acceleration of texture decoding.
Because you want to keep an eye on the video, you do not want to stop the computation-intensive video decoding. However, you can degrade it in quality to a lower frame rate and scale it down to a smaller screen-within-a-screen image that is placed in a corner of the iPAQ display [see figure (PDF)]. The resulting reduction in computational load for the video decoding allows it to run as a software task on the StrongARM processor. The FPGA is now free to run the 3-D game texture decoding in hardware.
This scenario requires the ability to run multiple computational tasks (video decoding, texture decoding, and so on) on the platform, both in hardware and in software. These tasks also move seamlessly to and from the FPGA and the StrongARM processor, depending on the required quality of service for the different applications.
Until recently, FPGAs were used mainly for glue logic to connect chips on a printed-circuit board and for rapid prototyping of IC designs. Because of the high cost and small size of the circuits that could be programmed on FPGAs, makers of mass-market products didn't see them as economically feasible. But new devices from the likes of Xilinx and Altera offer the equivalent of millions of system gates, which translates to a relatively small cost per gate, making FPGAs much more attractive to manufacturers looking to add flexibility to their products at a reasonable cost. Just as FPGA makers now embed microprocessor cores in their devices, within the next five years we will be able to do the reverse: embed FPGAs in system-on-chips for use in consumer electronic devices.
In the United States alone, 130 million cellphones weighing about 65 000 tons will be retired in 2005, according to the independent research organization INFORM Inc., in New York City
Most research into dynamically reconfigurable systems concentrates on automatically recasting the hardware very frequently—say, at audio-sample or video-frame rates, to speed up signal-processing applications. Gecko approaches reconfiguration from a different angle. With this device, tasks are moved, and reconfiguration invoked, when the user decides that she wants another application to start.
Because user interaction typically occurs infrequently—at intervals of tens of seconds, minutes, or longer—the leisurely pace of reconfiguration on the Gecko's FPGA (1 to 10 milliseconds) is less of an issue than it would be for applications that require reconfiguration at frame rates, roughly every 40 milliseconds. If the user reconfigures Gecko every five minutes, then the reconfiguration overhead of 10 ms is relatively negligible.
In the case where a 3-D game is played, the user interaction will be more frequent, perhaps five times per second. However, the number of user inputs that will require a task to reconfigure itself from running on the FPGA to running in software on the StrongARM processor will account for very few of the total number of user inputs. So, like the video decoding scenario, the interval between game player inputs requiring reconfiguration is large compared to the time it takes to reconfigure the device.
Even though we believe Gecko's users can tolerate a negligible amount of reconfiguration time, we want to make multitasking on the device as fast and smooth as possible. For that reason we devised a way to reconfigure only a part of the FPGA to create a new task while letting the rest of the FPGA run other hardware tasks.
Reconfiguring one tile at a time
Any application is composed of a number of communicating tasks. Some of these tasks are not computationally intensive and can be executed in software on a general-purpose microprocessor, such as the StrongARM SA-1110 processor in our prototype. The more computationally intensive tasks, such as the motion estimation of what's shown in each frame in an MPEG-2 encoder, can be accelerated in hardware by configuring the FPGA to handle them. Considering that we may need to accelerate several tasks at once, the Gecko divides the reconfigurable hardware into three independent zones of equal size, called tiles.
As long as the number of tasks needing acceleration is smaller than the number of available tiles, the tasks can be distributed over the different tiles. But when the number of tasks exceeds the number of tiles, some tiles will have to be shared by a number of tasks or some less critical tasks will have to be switched over to a microprocessor, such as the StrongARM in our prototype.
The time it takes to change a tile's function ranges from 1 to 10 ms, depending on the size of the tile. While this means that hardware tasks cannot be created at sample or frame rates, it's fast enough for a user commanding the Gecko to switch from a video to a game. As a comparison, creating a task in software requires about 100 microseconds.
Partial reconfiguration, though, is not enough: tasks also must be able to communicate with each other, whether on the FPGA or on the processor. Say you want to play a game while keeping an eye on a smaller screen-within-a-screen on the iPAQ displaying streaming video coming from the Web cam in your baby's bedroom. When you launch the game, a newly created game decoder task spawned on the FPGA must communicate with a display task running on the iPAQ's StrongARM to get the game shown on the iPAQ screen.
The video decoder, also running on the FPGA, uses a channel to communicate with the display task to show the baby cam video on the screen. When the game decoder starts running, however, it must take over that channel. Normally these communications channels are created as applications begin running, which would force us to reroute all connections between the configurable cells of the complete FPGA every time the user fired up a new hardware-accelerated application. But that's just not feasible for our purposes: recalculating the routes for all these connections would take hours of central processing unit time on a powerful computer-aided design workstation.
So, instead of rewiring the whole FPGA, we decided to route packets over a fixed interconnect network, letting the tiles on the chip communicate much as computers do over the Internet. Tasks run in the tiles and communicate with other tasks on other tiles, on the microprocessor, or on an ASIC, by sending messages via the network. Since the interconnect network and the tile interfaces are fixed, tasks can be dynamically created and deleted without affecting those running in other tiles.
Keeping tabs on tasks
The key to a great user experience with a device like Gecko lies in making the transitions from one function to another as smooth as possible. This responsibility falls to the real-time operating system, which manages all these complex transitions.
We based Gecko's Operating System for Reconfigurable Systems (OS4RS) on a real-time version of Linux. The OS4RS manages the dynamic creation of hardware tasks and handles communications among them. It also determines when and on which resource to schedule newly created tasks. When switching between tiles on the FPGA or between the FPGA and the StrongARM microprocessor, OS4RS must suspend certain tasks that are running so that other tasks can take a turn. To do so, it must remember the state each task was in when it stopped so that each task can restart from the same state.
To find a way to seamlessly and automatically switch a task running in software on the microprocessor to the FPGA tiles, we looked at traditional microprocessors and operating systems. These solved the software half of the problem a long time ago.
When multiple tasks need to run on a microprocessor, an OS grants each task a time slot on the processor. A running task, X, sometimes needs to be suspended temporarily so that another task, Y, can be run for a time, after which task X is resumed. Handling this suspension is a function called a context switch, which requires the operating system to save the context of the task in the processor's memory at a predefined location.
The context of a task denotes its state: all the information required to resume the task at the point where it was interrupted. For a task running in software on a microprocessor, context includes what's in the processor's registers, the data in the memory on which the task is operating, and information regarding the current state of execution of the task, such as the program counter.
While such a software context switch will work on Gecko's processor, the device's reconfigurable hardware requires special handling. In particular, not only the software but also the hardware states of the same task must be represented consistently.
That consistency is supplied by providing the system designer with a selection of objects that represent tasks and contain timing information. When the code that defines the objects is generated for both hardware and software, the tasks will behave uniformly, regardless of where they are running. By guaranteeing uniformity, we can write chunks of code, called switching points, that will work when the task runs in the FPGA or on the StrongARM. When a running task hits a switching point, it will stop and pass its state information to the OS4RS, which will store it in a defined format in the processor's memory.
Besides switching points, we need one more thing for dynamic reconfiguration. To move a task from the FPGA to the StrongARM microprocessor and back, the operating system needs to know where each task is at any given time. So it assigns every task a logical address. Whenever the operating system schedules a task on the FPGA, an address translation table is updated. This table lets the operating system translate a logical address located in the registers of the StrongARM microprocessor into a physical address based on the location of the task in the FPGA's interconnect network. With switching points embedded in each task and the operating system aware of each task's location, we're ready to reconfigure.
When the user decides to play a game while keeping an eye on the baby, the operating system will signal the video decoder task that it should relocate from the FPGA to the microprocessor, to free the FPGA for the game decoder task. As the video decoder task reaches a switch point, it is interrupted and transfers all of its state information to the operating system, which saves the task's context in the memory of the microprocessor. The operating system then resumes the relocated video decoder task on the microprocessor, where it will start up again right where it left off—but in software running on the microprocessor instead of as a circuit running on the FPGA. Now the FPGA is free to start the game decoder task.
In addition to allowing a user to run several programs on the same device simultaneously, the Gecko concept also gives us that glimpse of a future where a device's longevity can be extended almost indefinitely.
Every PC user is familiar with the scenario where application software autonomously checks for the availability of upgrades or patches on the Internet. Installing such an upgrade is as easy as clicking ”yes” on a dialog box. Connecting the reconfigurable hardware to the Internet extends this software upgrade scenario to hardware.
When a new video compression standard emerges, the compute-intensive tasks that need to be implemented in hardware to obtain acceptable performance can be downloaded over the Internet, together with the less compute-intensive tasks of the video compression standard. The Gecko's operating system will take care of running the compute-intensive tasks in one of the tiles of the reconfigurable architecture while keeping the less compute-intensive tasks on the processor.
IMEC isn't the only research group developing this kind of reconfigurable system. Similar research is under way in labs at the University of California at Berkeley, the Imperial College of Science, Technology and Medicine in London, and the Massachusetts Institute of Technology, where researchers are working on a similar architecture as part of its RAW (Raw Architecture Workstation) processor project. RAW has 16 identical tiles containing programmable microprocessors, floating-point arithmetic units, and memories that communicate over an interconnect network that supports both compile-time and run-time routing.
The Gecko gives us a glimpse of a future where a device can be all things to all people and its longevity can be extended almost indefinitely
Clearly, Gecko is far from the final word on reconfigurability. Rather, it is the first step toward a future flexible system-on-chip platform. Such a platform will integrate the discrete components used in the Gecko platform into a single piece of very flexible silicon. This next-generation Gecko will be built in a 45-nm technology, starting in the 2008-to-2010 time frame (we're at 90 nm today). It will consist of a regular array of FPGA tiles, each approximately 2 mm by 2 mm, connected by a packet-switched network. Some tiles will contain instruction-set processors, others FPGA hardware, and still others dedicated custom hardware.
Whatever this chameleon of devices will be called once it makes it to market is anyone's guess, but its heart, the hardware task concept, and the operating system technology, will be pure Gecko.