Helping Computers Help Themselves

The IT world's heavy hitters--IBM, Sun, Microsoft, and HP--want computers to solve their own problems

PHOTO: STEVE STANKIEWICZ

This is part of IEEE Spectrum's special R&D report: They Might Be Giants: Seeds of a Tech Turnaround.

If you're being chased by a big snarling dog, you don't have to worry about adjusting your heart rate or releasing a precise amount of adrenaline. Your body automatically does it all, thanks to the autonomic nervous system, the master-control for involuntary functions from breathing and blood flow to salivation and digestion.

At IBM Corp. [ranked (5) among the Top 100 R&D Spenders] and elsewhere, researchers are mimicking that model by developing the components and feedback loops necessary for computer systems to run themselves. The hope is that the constant and costly intervention of database and network administrators trying to figure out what must be done will soon be a thing of the past. Among the first of these projects are some that enable computer systems to optimize computing resources and data storage on their own.

Farther off in the future are other components of this autonomic vision, like maintaining ironclad security against unwanted intrusion, enabling fast automatic recovery after crashes, and developing standards to ensure interoperability among myriad systems and devices. Systems should also be able to fix server failures, system freezes, and operating system crashes when they happen or, better yet, prevent them from cropping up in the first place.

Extricating the human from the loop is all the more urgent because of the outlook for the next decade. By some estimates, 200 million information technology (IT) workers might be needed to support a billion people using computers at millions of businesses that could be interconnected via intranets, extranets, and the Internet.

To compete successfully in so large a market [see "IT Services Market Keeps Climbing," below] calls for heavy investment up front: IBM has earmarked "the majority" of its US $5.3 billion annual R&D budget for autonomic-related research.

chart, IT services
IT Services Keep Climbing: IBM, Sun, H-P, Microsoft, and their competitors are working to automate various information technology (IT) services, notably hardware and software maintenance and support, where autonomic-like products could grab a share of a large and growing market. Click on the image for the full illustration view.

Success will also require cooperation among corporate, academic, and government research labs. Big Blue, for one, is at the early stages of developing such relationships and hosted an autonomic computing symposium this past April. The event, held at the IBM Almaden Research Center (San Jose, Calif.), attracted more than 100 attendees from various universities and companies, including Stanford, Columbia, Cornell, and the University of California, Berkeley, and rivals Sun Microsystems (42), Hewlett-Packard (30), and Microsoft (12).

Overwhelming complexity

The network challenge today is far tougher than in the days of the big networks in the 1980s, notes Alan Ganek, vice president of autonomic computing who oversees all of IBM's autonomic-oriented work from the Thomas J. Watson Research Center (Hawthorne, N.Y.). "Now we have PCs, laptops, smart phones, and PDAs all running different operating systems," he points out. The scale and complexity of these networks are strikingly different from, say, a bank's automated-teller-machine network that links 20 000 ATMs in a private network to a single data center, he says.

An early target of the R&D aimed at creating adaptive computer systems underscores the complexity issue: the allocation of computing power within grids comprising often heterogeneous distributed CPUs and data storage devices. As defined by Wolfgang Gentzsch, Sun's director of grid computing, a grid is a "hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to computational capabilities" by connecting distributed computers, storage devices, mobile devices, databases, and software applications.

Sun's own N1—or Network One—project, is a long-term plan to combine a number of different technologies, including grid computing and high-performance file systems, so that a pool of resources may be allocated dynamically to meet a range of user needs. Following similar lines, H-P's Planetary Computing is an architecture for networked data centers. It enables each center to automatically reconfigure its software infrastructure and allocate data storage and server resources wherever demand indicates. Its first Planetary Computing product, the Utility Data Center, was launched late last year to "provide automated infrastructure on demand with little or no operator intervention."

IBM's vision for future computer systems is somewhat broader than its competitors'. The new capabilities it is developing were mapped out in its Autonomic Computing Manifesto, released last year. The document describes an open standards-based system, one that: tracks its own resources, so it can share them with other systems; repairs itself; and maximizes how it uses its resources so that workloads can be moved to whichever servers and data storage facilities will process them most efficiently. The system also protects itself from viruses and hackers and anticipates users' information needs. The trick is to embed a layer of smart middleware deep in the system. This middleware will monitor system performance and execute repairs, resource allocations, and applications as necessary, without barraging network administrators with new parameters to set and operational decisions to make.

IBM Research has already started to guide the company down the autonomic path. Offerings in its family of relational databases called Database 2 (DB2) maintain themselves and optimize querying. Servers that automatically allocate storage and computing resources are moving to market, thanks to the Server Group's Project eLiza.

Still in prototype is a redundant server system based on storage and compute "bricks" called the IceCube server. Comprising an array of Collective Intelligent Bricks (CIBs), IceCube is the smarter son of industry-standard Redundant Array of Inexpensive Disks (RAID) systems. The IceCube server tracks its own components, health, maximum capacity, and relationship to other systems [see illustration]. It can distribute data among many CIBs even though your computer thinks they're all in one place. The system works around any failed node, negating the urgency, or even the need, for repair. Failed bricks are left in place until there comes a convenient time to fish out the dead soldier.

illust., Collective Intelligent Bricks
Illustration: Steve  Stankiewicz
Building Storage, Brick by Interchangeable Brick: IBM is building a self-managing, modular data storage server, known as IceCube, using Collective Intelligent Bricks (CIBs). Each CIB, 15 cm on a side, is outfitted with twelve 100-GB hard disks, a Pentium microprocessor running Linux, and six 10-Gb/s capacitive couplers to interface each face of the brick with the adjacent one. A prototype containing 27 bricks will be completed by year's end. Click on the image for the full illustration view.

A faster way to run a database

Database administration is one of the biggest time-sinks of IT department brainpower. A decade ago, according to Ganek, hardware soaked up 80 percent of the cost of a data center. Now half the money spent goes for the people to run the center. And that share is rising. Why so? Quite simply, databases are very complicated and require constant care and feeding.

Imagine that a salesperson wants to know how many widgets his company sold in Africa in 2001. In response to this query, a system searches all 2001 global sales data in its database for sales related to Africa. Alternatively, it might search only data relating to transactions in Africa in 2001. Obviously, the second alternative is the more efficient, because it searches the shorter index (a list of keywords, each identifying a unique record). Choosing the best approach is the job of query optimizers. Now IBM is beefing up its optimizer to adjust and optimize queries on the fly, automatically.

All major database products based on the popular Structured Query Language (SQL), including Oracle, Microsoft SQL Server, Informix, and Sybase, have such optimizers, explains Guy Lohman, manager of advanced optimization research in Almaden's advanced databases solutions department. The optimizer determines the best way to answer a given query by modeling alternative ways to process the SQL statement.

More specifically, it determines different ways to access and join records listed as rows in tables, and it also decides the order in which those access- and-join operations are applied to the rows of records. Queries are made up of key words and fields that together form an element called a predicate. To plow through even two tables with a handful of predicates, the current version of IBM's DB2 optimizer might consider over a hundred different plans—particular sequences of operators (mathematical actions that, for instance, compare one value to another).

"This is all done automatically by DB2's query optimizer, under the covers. The user isn't even aware it's going on," says Lohman, whose group is developing the Learning Optimizer, LEO. LEO is part of the IBM's SMART (self-managing and resource-tuning) database technology that will be integrated into future versions of the company's DB2 product. "That's autonomic and we've been doing it for over 20 years in DB2," he says.

One of the challenges LEO overcomes is the possibility that the statistics on the database such as the number of rows on each table could be out of date.

"If you've got 10 000 transactions coming into Amazon every minute, you don't want to lock up part of the database that stores those stats," says Lohman. "Instead we collect statistics periodically." LEO codes a counter into the software that tabulates the number of rows the database processed in each step. After the query is completed, the system compares the estimate made by the query optimizer's internal mathematical model with the actual number of rows processed, and updates the statistics accordingly. The next time the user executes a query that must traverse similar terrain in the database, LEO will know a much better path to the data.

Currently, LEO, which is still in prototype, can detect and correct errors in the statistics or the estimates for a single predicate. Ultimately, IBM hopes LEO will be advanced enough to detect correlations among multiple predicates and to adjust a query's plan as it's executed. The magnitude of correlation among predicates will affect how many rows must be traversed and thus affect which plan is deemed optimal.

IBM faces stiff competition in this space from a company with pockets as deep as it own. Microsoft Corp., which plans to increase its R&D spending 20 percent this year to $5.3 billion, already has a technology similar to LEO on the market.

The Index Tuning Wizard was introduced in 1998 as part of the Redmond, Wash., company's SQL Server 7.0 database. The Wizard tracks the server's behavior in response to queries, gathering information about the workload. By examining this data, it detects potential performance problems, notifies the database administrator, and recommends fixes, such as creating an index to speed up slow-performing queries. The Wizard also provides the administrator with an estimate of expected performance improvement if its suggested changes are executed—up to 50 percent over a basic database design, according to an industry decision support benchmark.

When first released, the Tuning Wizard was the only commercial tool to advise the administrator on how to make changes to a database's design so as to speed up a query and calculate potential performance improvements, says Surajit Chaudhuri, a senior researcher in the Data Management, Exploration and Mining group at Microsoft Research and manager of the AutoAdmin team that developed the Wizard. The current version of the Tuning Wizard included in SQL Server 2000 is unique in that, for a given workload, it can recommend a judicious combination of indexes as well as materialized views. (Such a view is a cache of pre-computed answers that the database searches when it recognizes a certain kind of question.). The result: an estimated 80 percent performance improvement over a basic design.

The AutoAdmin researchers continue to develop databases that track their own usage and adapt accordingly to remain at the same level of efficiency and reliability under changing workloads and constant expansion. "You don't have to understand the machinery of a car to drive it," says Chaudhuri. "You also don't need to understand how a database runs—you just need a steering wheel to set policy and manage information."

Managers of the service

"While I do find the biological metaphor of autonomic computing very cute, it falls apart," declares the new CTO of Sun's N1 program, Yousef Khalidi.

Instead of autonomic computing, Sun prefers the metaphor on which it rode to prominence in the 1980s: the network is the computer. A leader in commercializing grid computing, Sun's recently unveiled N1 initiative combines the company's grid efforts and so-called network virtualization technology for the dynamic allocation of resources. For instance, Sun has already started to offer customers of its 6900 back-end storage arrays the ability to dynamically assign CPU and cache resources.

"We want humans to manage the services, not to manage the servers," says Khalidi. "N1 models application needs. Once you have the models, then you can optimize, you can automate, you can set policy."

Sun is leveraging its grid computing expertise into the network administration domain. The company now supports customers with some 5000 grids with a total of 220 000 CPUs, with new grids sprouting at the rate of 70 per week.

According to Peter Jeffcock, marketing manager for Sun's client and technical market products group, companies like Motorola (13) are already exploiting some of the self-adapting features of the company's open-source Grid Engine and its proprietary version, Grid Engine Enterprise edition software. This software keeps tabs on a list of tasks as well as the list of CPUs available to the grids. Depending on priorities set by the user, the program will assign tasks to CPUs automatically.

For instance, Motorola's wireless products group has 250 CPUs on its grid and uses Grid Engine to establish a policy that resources during normal use are split evenly among, say, two groups of chip designers. As one group approaches a deadline to finish a chip design and requires a lot of extra computational power to run simulations that test the design's functional accuracy, the software assigns more resources to those tasks. The idea is to increase CPU utilization from the typical 10­30 percent used when CPUs are off the grid, to 98 percent or more when they are on the grid, while putting the resources at the disposal of the most mission-critical project.

IBM's first product aimed at making network administration less of a burden on human administrators rode to market last year on the back of a lizard: Project eLiza, an effort to create self-managing servers. Liza is short for lizard, a name chosen after researchers estimated the processing power of Deep Blue—the famous chess-playing supercomputer that beat world champ Gary Kasparov—to be on a par with a lizard's. Project eLiza's goal then is to bring a lizard's level of autonomy into the server arena so that servers can survive and adapt in unpredictable environments. So far, eLiza has spawned IBM's eServer product line, the first commercial products from autonomic R&D, and most recently, Enterprise Workload Management (eWLM) software based on work at IBM's Thomas J. Watson Research Center.

Because today's server farms harbor as many as several thousand computers, it's nearly impossible for an administrator to keep abreast of the topology of a server network: the physical servers, storage, and network capacity, the machines where databases and applications reside, and the path that an incoming transaction takes. Configuring each server by hand to properly process transactions can take weeks.

"Any [server] customer would tell you that when the number of servers they have to manage goes beyond 20, they lose track of what kind of servers are on which floor or what operating system is running on them," says Donna Dillenberger, a senior technical staff member at Watson.

IBM's eWLM software automatically discovers what server resources, including computers and applications, are on a server farm; what demands the incoming transactions place on applications; and which path those transactions take through the farm. eWLM then identifies delays associated with CPUs, I/O, memory, network, applications, and even particular machines. Using self-learning algorithms, eWLM dynamically routes a transaction around an overburdened portion of the network. The current eLiza element is undergoing tests at several insurance and financial-service firms before full release later this year.

Dillenberger and her team are also hard at work closing the feedback loop so the software, based on the information it collects about its resources, can self-optimize the entire system to meet specific performance goals. For example, a system administrator at a money management firm could issue a policy that all stock trade transactions must be completed in one second, whereas other tasks may be permitted to take longer. eWLM, Dillenberger says, should be able to look at the topology of the network and predict whether it can do this. If not, eWLM could alter the topology, calling for, perhaps, another server on the farm to provide temporary relief.

"You should be able to set high-level goals in plain English and the computer learns how to adjust the knobs," she says.

Fast recovery

While allocating resources and optimizing performance are part and parcel of the autonomic vision and are already showing up in products, automatic recovery from crashes is a feature that requires more applied research.

One ongoing effort is led by the University of California, Berkeley's David Patterson, notable for, among other feats, co-inventing the RAID approach to storage. The year-old Recovery-Oriented Computing (ROC) project he has undertaken in collaboration with Stanford's Armando Fox shifts away from the common failure-avoidance­based approach to system design and toward an emphasis on speedier recovery. ROC has funding from the National Science Foundation, Microsoft, Hewlett-Packard, Velocity and IBM.

A common failure-avoidance approach is preventive maintenance. Every week, huge commercial Web sites, like eBay, become unavailable for a couple of hours in the middle of the night, so administrators can mend any parts on the verge of failure. Rather than taking the entire site offline, the ROC approach is to repair individual components while the site is live.

Another ROC intention is to add systemwide support for the Undo command. If a user of almost every desktop software package, from word processors to e-mail clients, makes a mistake, the package makes the error easy to correct. Simply hitting a couple of keys brings the system back to its state before the incorrect command was executed. The ROC project is adding this feature to entire systems, like e-mail servers. That way, if your entire e-mail folder is accidentally trashed, the computer can locate the folder and bring it back.

"Undo is a three-step process of rewinding time, untangling problems, then replaying the system back to the current time," Patterson says.

Still, challenges arise when applying Undo to more complicated problems. For example, if an e-mail message is unwittingly downloaded and not filtered for viruses, Patterson says, ROC would allow a network administrator to go back to the state of the e-mail server before the e-mail was filtered, install the filter, and then execute the operation again with the filter in place. Things get stickier, though, if someone had already opened the virus-carrying e-mail before the Undo was executed.

Grids and standards

Ultimately, an autonomic architecture is only as good as its ability to deal with heterogeneity—the diversified platforms, operating systems, devices, and software in today's large networks.

According to IBM, open standards are not only essential to the deployment of autonomic technology, but they also level the playing field for the companies doing the innovating. "We want to sell our middleware based on fair competition with an equal set of standards," says Almaden Research Center director Robert Morris. "People should buy our toaster because it toasts bread the best, not because it has the only plug that fits in the outlet."

Already, IBM, H-P, Microsoft, Sun, and other IT firms and universities are laying down the ground rules for sharing applications and computing resources over the Internet. The Globus Project is a multi-institutional R&D effort to establish computational grids of shared resources. Working with the project, the Global Grid Forum in February introduced the Open Grid Services Architecture, a set of specifications and standards applicable to resource sharing and Web services.

"There are no secrets in autonomic computing technology," Morris adds. "We already know how to do many things, we just haven't applied them. But the application is as important as the invention."

About the Author

DAVID PESCOVITZ is the writer-in-residence at the University of California at Berkeley's College of Engineering (david@pesco.net) and contributes to Wired.

To Probe Further

For what IBM Corp. envisions for computer systems of the future, read its Autonomic Computing Manifesto at http://www.research.ibm.com/autonomic/.

Wolfgang Gentzsch of Sun Microsystems Inc. describes the evolving corporate IT infrastructure in "Grid Computing, A Vendor's Vision," Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2002, pp. 272­77.

Follow David Patterson's recovery-oriented computing (ROC) project at http://roc.cs.berkeley.edu/.

Related Stories

Advertisement
Advertisement