Sometime next year, if all goes as planned, the largest scientific instrument ever built will come to life in a labyrinthine underground complex in Switzerland, near Geneva. Buried more than 100 meters down, the Large Hadron Collider (LHC) will send two beams of protons in opposite directions around a 27-kilometer-long circular tunnel. The beams, whizzing at nearly the speed of light, will collide head-on, producing a shower of subatomic fragments that scientists expect will include exotic, never-before-seen particles that could change our fundamental knowledge of the universe.
That’s the hope, anyway. Researchers at the European Organization for Nuclear Research (CERN), which will operate the LHC, know that spotting the elusive bits of matter they are looking for will be a daunting task. To find them, the researchers will have to sift through a colossal haystack of collision data: the LHC is expected to spew out some 15 million gigabytes a year—on average, that’s more than enough to fill six standard DVDs every minute.
Storing and analyzing the mountain of data, it turns out, is a task that no supercomputer in the world can handle. So while the LHC team rushes to finish the mammoth subterranean machine, above ground another group of physicists and computer scientists has been solving a problem of its own: assembling a computing infrastructure able to handle LHC’s data deluge. Their solution? A vast collection of high-powered computer systems scattered in nearly 200 research centers around the world, networked and configured to function as a single parallel processing system. This type of infrastructure is known as a computing grid.
Computing grids emerged in the late 1990s as an alternative to traditional supercomputers to solve certain problems demanding powerful number crunching and access to larger amounts of distributed data. The idea was that with sufficiently fast networks and the right software, multiple and geographically dispersed research groups could pool their computing and data management resources into a unified system capable of tackling problems that would be out of reach for any of them alone. Such grids, those early researchers hoped, would do for serious computing power what electricity grids did for electricity: make it available everywhere. Just plug your PC into a computing grid and you’d have instant access to supercomputing power at an affordable cost.
We are not quite there yet. Today, although grids have sprung up all over the place, most of them are specialized systems available to only a small cadre of researchers in fields such as high-energy physics, genome research, and earthquake monitoring. How, then, can we turn grids into an everyday research tool that can energize a wider range of scientific and technical pursuits?
That is the question CERN and its partner universities, research agencies, and companies—most of them in Europe but some in the United States, Asia, and Latin America—hope to answer by building on the experience of the LHC grid to create a massive global grid infrastructure. Led by CERN, the group wants to transform this new global grid into a tool capable of solving a great variety of problems in science, engineering, and industry.
The initiative, funded by the European Union, is called Enabling Grids for E-sciencE (EGEE). Behind the awkward acronym lies an ambitious effort [see illustration, “Going Global”]. The EGEE grid now combines the processing power of more than 20 000 CPUs, a storage capacity of about 5 million GB—growing rapidly in anticipation of the LHC data—and a global network connecting some 200 sites in such places as Paris, Moscow, Taipei, and Chicago. The grid is already crunching test data for the LHC experiments [see sidebar, “The Big Data Bang”] and also for dozens of applications in such areas as astrophysics, medical imaging, bioinformatics, climate studies, oil and gas exploration, pharmaceutical research, and financial forecasting. It’s now the world’s largest general-purpose scientific computing grid, and it’s getting bigger every month.
Grids seem to be a natural evolution in the history of distributed computing. If you look at some of the early supercomputers, these machines were refrigerator-size cabinets that would divide the computing workload among multiple processors. Then came clusters, groups of relatively cheap, off-the-shelf computers, typically PCs running Linux, that would form large parallel processing systems sprawling across entire rooms, buildings, or even campuses. But with computer networks becoming faster and cheaper, some researchers figured that an even more diverse and dispersed union of machines would be possible.
These researchers envisioned an infrastructure that, unlike supercomputers or clusters, would be owned, managed, and used by multiple organizations. And instead of a monolithic hardware and software design, this infrastructure would run on a mix of operating systems, file systems, and networking technologies. Thus emerged the idea of grid computing, and a bunch of grid pioneers went to work on realizing that vision. Ian Foster of the Argonne National Laboratory, in Illinois, and Carl Kesselman of the University of Southern California, in Los Angeles, were among the pioneers, and in 1998 they published The Grid: Blueprint for a New Computing Infrastructure (Morgan Kaufmann), a book that became an instant bible for the new field.
At least in theory, grids would include all kinds of systems: supercomputers, giant clusters, and desktop PCs, as well as storage devices, databases, sensors, and scientific instruments. But although many grid projects are evolving in that direction, most still amass a more homogeneous collection of systems.
The EGEE grid consists mainly of multiple clusters of PCs—some institutions have a dozen machines, others thousands—connected to “farms” of disk servers and specialized magnetic tape silos that are used for backup and long-term storage of data. The grid relies on the Internet and on high-speed, dedicated research networks to distribute computing tasks among the clusters owned by different parties, just as if the machines were sitting in the same room. Of the EGEE’s networks, the most important is called Géant2, an academic fiber-optic backbone that links 34 European countries and has connections to similar research networks elsewhere in the world.
The sheer computational power of some supercomputers today still exceeds EGEE’s, although that could change someday, depending on how fast this grid grows and how quickly it can join forces with other major national and international grid efforts. For example, the Open Science Grid, a U.S. initiative that already links a large number of data centers in more than 50 institutions, mirrors in many ways the European Union’s grid initiative and should be interoperable with EGEE soon. In Japan, a project called the National Research Grid Initiative is developing a grid infrastructure for science very similar to EGEE, and collaboration between the projects has already started.
But the fact is that such grids, even joined together, won’t replace supercomputers quite yet. Some problems—certain types of climate simulations, for instance—involve calculations that are so intertwined that a supercomputer’s multiple processors need to exchange data at dazzlingly fast speeds, a capability difficult to achieve with grids. Grids like EGEE aim at what’s called high-throughput computing—dealing with large amounts of similar but independent calculations. In other words, the applications that benefit best from grids are those that can be chopped into many smaller pieces and processed in parallel.
Although grids remain mostly an academic research tool, they are making strides into the corporate world. Many large firms, after investing in technology for e-business, customer relationship management, and supply chain systems, are now putting computing grids high on their acquisition lists. Early adopters include financial institutions, some of which are using grids to perform sophisticated risk analysis, and pharmaceutical companies, which are using grids to study the effects of new drugs.
With an eye on this market, all the major computer vendors, including Hewlett-Packard, IBM, Microsoft, and Sun, now offer hardware and software that enable servers, PCs, and mainframes to tap into the power of a grid. Useful though such commercial offerings are, they can’t yet manage the number of computers, networks, and systems in a massive grid like the EGEE. What’s more, different vendors ended up developing different grid technologies. So unlike the Web—developed at CERN, incidentally—which is based on a common set of standards, existing grids are based on a wide variety of technologies. The field, it seems, is too immature for any one company to risk launching products and services on a large scale. Industry in this case adopted a wait-and-see policy, letting the academic community take the lead, as has often happened with emerging technologies.
Indeed, this is why major grid projects still need public funding. International grid initiatives such as EGEE aim to take the grid model one step further by developing and testing computing, networking, and security technologies and pushing for common standards. This complements the efforts of international standards bodies, such as the Global Grid Forum, which champions technological convergence and interoperability.
But despite such efforts toward standardization, the term grid computing—as is often the case with technological buzzwords—has come to mean different things to different people. A source of confusion is the concept of on-demand or utility computing created by computer vendors. The idea is that a customer needing extra processing power can tap into a vendor’s data center, paying for what it uses. This is an interesting concept, but it doesn’t strictly require grid technology.
And then there are systems for scavenging computing power such as SETI@home, the popular screen saver that relies on ordinary people’s PCs to search radio-astronomy data for signs of extraterrestrial intelligence. Considering that SETI@home has been downloaded onto more than 5 million PCs around the globe, the ambitions of a project such as EGEE to federate 100 000 computers seem modest by comparison. But the similarities belie major differences. The sort of scientific software running on the EGEE grid involves complex configurations and requires reliable and secure data transfers that casually connected PCs cannot provide.
CERN is the setting for Dan Brown’s blockbuster novel Angels & Demons. But Brown’s CERN, a shiny space-age institute with a Mach 15 plane and a futuristic machine that produces antimatter in bulk, is a far cry from the real one, with its amalgam of disheveled offices erected as temporary constructions in the 1970s. Still, appearances can deceive: inside the CERN buildings are some of today’s most advanced science and technology projects.
In an unassuming bungalow appended to CERN’s main computing facility, the EGEE project has established its nerve center. When the project began, in April 2004, its goal was to create a grid connecting institutions in Europe, as well as some in the United States and Russia. But the initiative quickly evolved into a truly global effort as additional groups in Asia and the Middle East joined in. The project completed its first two-year phase in March, and a second two-year phase has begun. The goal is to expand the grid to more sites, eventually reaching over 100 000 CPUs and tens of petabytes of storage by 2008 (a petabyte equals about a million gigabytes).
Perhaps the most important component in a grid is what’s called middleware. This is a set of programs that each institution needs to run on its system to make it part of a grid. Middleware checks what resources—processing power, storage, databases—are available and decides where and when to process the tasks submitted by users. It works as an intermediary software layer between the grid’s computers and networks and the users’ applications. Thanks to middleware, the complex underlying structure of a grid is made transparent to researchers, who see it as a single virtual machine.
Rather than trying to develop middleware from scratch, EGEE adopted software components from a host of different grid-related projects. To manage all grid functions, EGEE’s middleware, called gLite, relies on some components from the Globus toolkit, a widely used grid software package developed by Foster, Kesselman, and colleagues. And to perform its distributed computing tasks, gLite relies on Condor, a popular computing workload management system developed by Miron Livny’s team at the University of Wisconsin–Madison. Despite its name, gLite has over a million lines of code, in a mixture of computing languages, and the package is larger than 100 megabytes. But the name reflects the fact that a pragmatic approach has been taken to getting a basic set of grid services running on a global scale and worrying about more advanced features later.
To get an idea of how the grid works, consider a recent effort involving Asian and European laboratories that used EGEE to analyze 300 000 potential drug compounds against the lethal strain of bird flu virus, H5N1. About 2000 computers on the grid ran a molecular simulation program for four weeks in April—a task that would take a single computer 100 years. The effort relied on an existing program that verifies the ability of a given drug molecule to “dock” into key proteins on the surface of the H5N1 virus, thereby blocking the action of those proteins and disrupting the ability of the virus to spread. Six EGEE participating institutions in France, Italy, Spain, and Taiwan selected the drug candidates, and each submitted its simulations to the grid.
The simulations are received by a middleware component called a resource broker, which evaluates the total computing power needed to process all the simulations—“jobs,” in computer jargon—and then checks the resources available throughout the grid. Not all EGEE members participate in a given effort, so the resource broker needs to find out which can be used. One way this is done is by defining a “virtual organization,” a subset of all participating institutions with a common scientific interest. In the flu simulation, an EGEE virtual organization called BioMed, with some 60 sites in Europe, Russia, Israel, and Taiwan, was the major player.
Next, the resource broker dispatches the jobs to different places using the Internet or dedicated networks. A piece may go to the Corpuscular Physics Laboratory of Clermont-Ferrand, in France, while another flows to the Genomics Research Center, in Taiwan; a third may find its way to the University of Birmingham, in England, while yet another ends up at the Institute for Biomedical Technologies, in Italy. Those and many other jobs are processed, and the results are sent back to the resource broker, which then reassembles them into a complete solution. Bit by bit and in sites separated by continents and oceans, the problem is solved.
The output of the data from each calculation was stored on three sites, two in France and one in Taiwan, to ensure redundancy. More than 60 000 output files, with a data volume of 600 GB, were created and stored on the EGEE grid. Potential drug compounds against avian flu are now being identified and ranked according to how successfully they seem to block the action of the virus in the simulations.
Other applications running on the EGEE grid work similarly [see sidebar, “Problem Solver”]. Some researchers are using EGEE to analyze medical images such as positron emission tomography (PET) scans stored in hospitals. Engineers at Compagnie Générale de Géophysique, in Paris, are using the grid to process geophysical data for oil, gas, mining, and environmental studies. And UNOSAT, a U.N. initiative, is experimenting with a grid application to compress raw high-resolution satellite data stored in repositories worldwide, so that the imagery can be transmitted faster to its field staff working in disaster recovery and post-conflict relief.
We are not at the tipping point where grid usage for science becomes as commonplace as PCs and the Web are today, but EGEE users are relying on it more and more. Last year, more than a thousand researchers from five continents submitted about 2 million computing tasks to the EGEE grid; the jobs ranged from relatively small simulations to huge number-crunching problems. Plans for EGEE’s second phase call for the project to continue expanding its infrastructure and number of applications; as of the start of this second phase, 25 000 jobs per day were being processed. Also, the global scope of the project will be important for the expansion, with efforts under way to extend the reach of the EGEE grid to the Baltic states, the Mediterranean basin, China, India, and Latin America.
Still, challenges loom. One is maintaining the grid’s reliability. EGEE needs to complete its evolution from a research tool to a real production infrastructure. So a major component of the EGEE project—nearly half, in budgetary terms—is making sure that all participating institutions have well-trained staff, as well as establishing national and regional grid-service centers to coordinate upgrades and monitor performance, along with call centers to answer questions from harried scientists in umpteen languages.
At the end of the day, the grid will catch on only if researchers—who are, in general, fussy customers—are convinced that it gives them big gains and little pain. Botched processing jobs or poor user interfaces could easily turn off such users, who would probably stay with the local resources even if they were no match for the grid. With that in view, EGEE has just launched a new version of gLite, 3.0, which improves its resource allocation and data transfer capabilities so it can cope with the diverse demands of different scientific communities.
As for the future of the EGEE grid and others, experts believe that, as grids become routinely used, they will become indistinguishable from the Web and other Internet applications. That is, using a grid to access computing power or storage capacity should be no different than booking a flight or downloading music online. Like many technologies, the grid’s success will be established when it fades into the background of our technological lives.
About the Authors
FABRIZIO GAGLIARDI, a senior staff member at CERN, near Geneva, was the director of the EGEE project in 2004 and 2005. He is now with Microsoft Corp. as director for technical computing in Europe, the Middle East, Africa, and Latin America.
FRANÇOIS GREY manages communications and outreach activities for the information technology department at CERN.
To Probe Further
For more on EGEE’s technology and applications, visit: http://www.eu-egee.org.
The Open Science Grid is one of the largest grid projects in the United States: http://www.opensciencegrid.org.