Widespread electric outages are not merely a fluke but a symptom—of poor policy and management
One Smart Operator: Alert controllers at the New England Independent System Operator in Holyoke, Mass., saw the August 2003 outage coming and were able to disconnect their region in time to keep the lights on. Photo: Mitch Epstein
In his classic book Normal Accidents: Living With High-Risk Technologies (Basic Books, 1984), the sociologist Charles Perrow describes a day when everything just goes wrong. You lock yourself out of your house, leaving your car keys inside. The spare, normally hidden, house keys have been lent to a friend, and your alternate ride, a neighbor’s car, is in the shop. A strike has shut down metropolitan bus services, swamping the taxi fleet. On and on goes the litany, and in the end, you miss an important meeting. Perrow then challenges the reader to name the cause of the foul-up. The answer, of course, is everything and yet no one thing.
Perrow calls this a “normal accident,” because there’s no way to foresee all the things that might go wrong. One could resolve never to let the same mishaps strike again, but the safeguards would come at a cost and could not protect against all possibilities. You could go ahead and stash extra keys, yet still miss your meeting because your neighborhood is cut off by mud slides. Systems are prone to such accidents, Perrow argues, in proportion to their complexity and “tight coupling”: one problem causes others, like the bus strike’s overburdening the taxi service.
Arguably, the most complex and tightly coupled systems ever constructed for use in daily life are those making up the interconnected electric power grid, which is by its nature vulnerable to system accidents. When such accidents are rife, they must be regarded as the symptom of inadequate grid design and management, itself a product of a bad system of incentives. Tellingly, expert groups investigating Italy’s nationwide blackout of 28 September 2003 and the northeastern North American blackout of 14 August 2003 reached very similar conclusions about their underlying causes. What is more, if recommendations made following the three major western North American blackouts between 1994 and 1996 had been followed, the effects of the 2003 outages would have been far less severe.
These events underscore the urgency of adopting enforceable measures to reduce the frequency and impact of massive grid outages. The U.S. Energy Policy Act of 2005 gives a self-regulating electric power organization, subject to review by the Federal Energy Regulatory Commission, the authority to enforce reliability rules and regulations. That organization will be the North American Electric Reliability Council (NERC), a multinational council based in Princeton, N.J., Separately, Mexican and Canadian authorities have promised to back NERC’s regulations with the force of law.
Long overdue, the action by the U.S., Canadian, and Mexican governments will turn NERC into an organization almost like the mighty U.S. Securities and Exchange Commission, with the power to impose penalties on member companies that fail to abide by its rules. The action is especially significant in light of a U.S.Canadian team’s findings after the 2003 Northeast outages that NERC’s policies and planning standards had been violated, and that mistakes already identified after the mid-1990s western North American blackouts had been repeated.
Since August 2003, NERC has established a multidisciplinary process involving experts from the regional reliability councils to assure that planning and operating standards are developed, or reexamined, and adhered to. The goal is to guarantee that imbalances of load and supply will be isolated faster after outages.
At the same time, in the last six to seven years, U.S. investor-owned power companies have boosted investment in electricity transmission infrastructure to US $4.6 billion in 2004 from $2.6 billion in 1999. Similar infrastructure investments are being made in other deregulated electricity markets around the world. Still, there is a long way to go.
Some of the blame for grid maladies must be placed on the deregulation of the electric power industry. Utility industry restructuring has led to an unbundling of generation, transmission, and distribution activities, with coordination of systems put in the hands of entities called Independent System Operators (ISOs) and Regional Transmission Organizations (RTOs).
Deregulation erred, however, by freeing only the price of energy while continuing to fix the price a company could charge for the use of its transmission infrastructure. By encouraging companies to trade masses of power over long distances and to build new generators, the law increased demand for transmission systems but did little to enhance the grid to handle the increased energy flows. Carrying capacity did not keep up with needs, as power companies lacked the means of recovering investments in grid expansion. The risk of grid breakdowns caused by “normal” accidents increased accordingly.
More than 50 million people learned this lesson the hard way in August 2003, when a chain of seemingly preventable electrical events crashed the northeastern grid, darkening millions of customers’ homes and businesses in the United States and Canada. The first event—though as Perrow would maintain, not the cause—was a failure in vegetation management: an Ohio power company did not keep up with its tree trimming, allowing highly loaded transmission lines to sag into the branches and short out, causing heavy power surges in other lines. The company’s monitoring equipment wasn’t working properly, and so it was unaware of the escalating crisis. Even after the stress on the grid was identified, rather than attempt to isolate the problem by shutting down local service, the decision was made to ride through the escalating crisis. Prudent operating guidelines—and indeed recommended NERC procedures—would have required the affected part of the system to be “islanded,” that is to say, electrically isolated.
The sequence of outages that had led to the violent power surges were not anticipated on that historical August day, and the interconnected system load was much higher than the remaining grid could withstand. Power transfers from a distance ensued, overstressing neighboring grids, sparking a cascade of failures that blasted critical nodes like firecrackers on a string.
The sequence of “normal” events that led to the August 2003 blackout is far from unique. In fact, such sequences seem to be getting more frequent. In recent years, wide-area outages have affected upward of 140 million customers worldwide. Blackouts not caused by natural calamity have also struck Australia, Croatia, Denmark, England, Greece, India, Italy, New Zealand, Russia, and Sweden.
Cascading failures are almost intrinsic to electrical power, which is not linear in all its functions. Transmission lines are like water pipes, facilitating transmission of power. As water needs pressure to move, power systems need reactive power to maintain the flow of energy. Reactive power arises from voltage being out of phase with current: it is consumed by inductive loads like coils and the transmission lines themselves, and added by capacitive loads, often consisting of sophisticated, state-of-the-art equipment relying on large semiconductor devices and power electronics technology.
When there is net consumption of reactive power in a section of the grid, it must somehow be added to maintain voltage. Without reactive power, quality electricity cannot be delivered to all customers equally in the same circuit. But customers less affected by the problem don’t like having to pay to guarantee delivery to those more affected, and power companies—including the one in Ohio where the August 2003 blackout started—often have been remiss in holding enough reactive capacity in reserve. Making matters even more difficult, electric power—unlike water—cannot be stored in large quantities. Instead, there must be a constant balancing act between supply and demand of all available resources: power, reactive power, and transmission capacity. When this balance is pushed to the limit, the transmission system is at full capacity, and even a small interruption of power transmission or generation can cascade and lash out violently and unpredictably.
It can all happen in seconds. Operating the system in areas not thoroughly studied or outside of reliable limits endangers the grid’s remaining parts. At that point, an overloaded transmission line may cause cascading line tripping; the resulting outage can force generators and transformers to trip off line, widening the area affected. Because the generators still on line are no longer working in perfect phase with one another, oscillatory instability builds. As the power system separates into islands, the harmonious distribution of load and generation becomes unstable. System frequency starts to deviate from the nominal value. This imbalance pushes still more equipment to trip off line, causing more separation of generation, reactive power, and transmission resources that are delivering energy.
It would all be much more bearable if the pain lasted for just a few minutes or hours. But such has not been the case for many of the recent wide-area cascading events, including the August 2003 outage. It damaged custom-built equipment that in most cases could not quickly be brought back to full operating conditions. Workers often had to navigate many bottlenecks to restore power, patching together one abnormally isolated area at a time, lighting up pieces of the grid while leaving other areas dark, sometimes for days. Recovery times often exceeded the operating lives of the batteries and generators that private companies had installed as backups against just such an emergency. The total cost to the economy cannot be calculated accurately, but it must have been in the tens of billions of dollars. And it could have been worse: the Northeast outage could have struck earlier in the workweek, rather than late on Thursday afternoon in the summer.
Afterward, everyone asked Perrow’s trick question: What was the cause? All too many were quick to recite the details. Untrimmed tree branches, unintelligent electromechanical switches and instrumentation, broken monitoring systems, and bad communications all got their share of the blame. Few people, however, got to the gist—that the grid had been designed and managed for one purpose (reliability) but used for another (economy).
How did it happen? Perhaps human brains are poorly tuned for weighing the odds of rare but serious events. A person who will carry an umbrella when there’s a 30 percent chance of rain will neglect to schedule a medical test for a cancer that occurs in 1 percent of the population. Maybe that is why so many countries have starved their electric grids of the investment needed to ward off catastrophic outages. How many people are willing to put up with rotating outages at the time of day when electricity is most needed?
The original reasoning on infrastructure went as follows: if a power company were to offer its local customers all the power they demanded at all times, relying solely on its own resources, then it would need to stock up on reserve generation and transmission. Yet sometimes even that backup fails—through “normal” accidents—and the power company would thus wish to have recourse to the backup reserves of neighboring power companies. By linking local networks into wider grids, the power industry would be able to pool its reserve capacity.
But there is an inherent trade-off of reliability against revenue because electricity, unlike raw materials, cannot be stored in appreciable quantities. If power companies have to maintain more generation capacity than the local customers can use most of the time, and more than neighboring systems need under normal conditions, this represents a waste of capital. After all, it is costly to build and maintain generating plants to protect against just-in-case scenarios.
It was partly to correct this seeming waste, and partly to encourage investment in more efficient generators, that the U.S. Energy Policy Act of 1992 deregulated the price of power, allowing owners of generating facilities to charge whatever the market would bear for energy. Power wholesalers quickly responded to the incentive, as expected, by buying power where it was temporarily cheap and selling where it was dear. By trading power over interstate—indeed transcontinental—distances, power traders were able to squeeze out every last dollar from their reserves.
There is, however, no such thing as a free lunch. If we insist on making money out of all our fixed capital, all of the time, we will have nothing left to draw on in a pinch. We will have too little reserve capacity. In a deregulated power system, monetary concerns are dominant, inadvertently causing the degradation of reliable power delivery.
Before deregulation, investor-owned utilities could expect a fixed rate of return on investment in infrastructure expansion, in return for their “obligation to serve,” as standard utility commission language put it. Now, having no guaranteed return on investment and saddled with a requirement to open their transmission grids to competitors without much compensation, power companies see too little incentive to strengthen grid resources and set aside reserves. Research funding by utilities and power companies has suffered for the same reasons. Surely this is one reason that so many of the conclusions and recommendations made following the 1994 and 1996 western states outages were so widely ignored in the Northeast. The same has been true of findings from other blackouts around the world.
Perverse incentives have had their effect all along the power-industry food chain. Municipal governments, for instance, are prey to the not-in-my-backyard sentiment: their constituents consider transmission lines and transformers eyesores; some people even see them as health hazards. At the federal level, the reliability of the grid has not been at the heightened levels set after the 9/11 attacks put new emphasis on the vulnerability of the United States to all disasters. Legislation giving grid regulators more authority languished in Congress from 1998 until this year, when it finally was enacted as part of a comprehensive national energy bill.
That’s a start, but we must do much more to restore incentives needed for a stable power system. Let us begin with transparency. Had neighboring power companies or controllers been able to look straight into the data screens of network operators in Ohio, they would have had time to protect themselves against the threatened surge by building the electrical equivalent of a sandbag levee. But real-time sensors and sophisticated monitoring software either had not been installed or were not operational during the August 2003 disturbance. Fully maintained and operational supervisory control and data acquisition (SCADA) systems provide the basic operating data that support such operational needs.
More sophisticated systems can be deployed for automated advance warning purposes and to maintain the integrity of the grid between neighboring power grid companies, and to aid in restoration when needed. Today’s technology can be used to our advantage for intelligent advance warning systems and for restoration.
Defensive systems can be put in place to minimize impact of stressed system conditions to local areas, and visualization systems can be implemented to let neighboring grids know of impending outages and restoration progress that follows. We have the technology and know-how to monitor and synchronize points in the power grid using global positioning satellites to obtain early warning of potential problems.
Technical standards for grid infrastructure existed, and they were studied deeply in the years before deregulation began. For instance, standards called for at least 6 or 7 percent of peak generating capacity to be held in reserve as spare capacity so that the red line would never be approached. More transmission capacity also was to be held in reserve, and transmission equipment was to be operated with spare capacity. Event analyses of the wide-area cascading outages seen in recent years show that power systems are more stressed than in the past and that capacity reserves are inadequate. Also, parts of the interconnected systems have been operated outside of limits, and prudent reliability measures have been compromised, making grids more sensitive to any contingency.
Perrow points out that accidents will happen. As blackouts are caused by a sequence of low-probability accidents it may not be possible to completely avoid outages on parts of the grid. There is no reason, however, for outages to extend to large areas with the tremendous impact seen in the 2003 events in North America and Europe. What is more, it should be possible to restore electricity to affected customers faster, without imposing rolling blackouts.
To Stop disturbance propagation and speed up restoration, many other steps ought to be taken, in instrumentation and power electronics. Today, too many nodes in the network depend on electromechanical switches and instruments installed decades ago, which cannot be reprogrammed when a system falters. They also cannot be programmed to restore the system after a crash, segment by segment. We should move to the latest microprocessor-based switches, which act much more quickly to isolate equipment and, like the compartments in a submarine, prevent the entire vessel from being flooded.
Another benefit of modern power electronics is the relative ease of swapping new modules for broken ones, which affords great flexibility in maintenance. There is also much that can be done to improve operator standards. We should establish standards for the screening and training of personnel that match or exceed those of any other mission-critical profession, such as pilots, air-traffic controllers, and nuclear power plant managers. We must not only impart technical skills but also provide mechanisms for operators to make decisions without fear of repercussions from within their corporations, the media, or external supervisors.
From recent blackouts, we also have learned that designs and applications should prohibit automated systems from unnecessarily disconnecting equipment, exacerbating disturbances.
Some of the key elements for responsive restoration are:
- Well-defined procedures that require overall coordination within the restoring area, as well as with neighboring electric networks.
- Reliable and efficient restoration-software tools that can significantly aid operators and area coordinators to execute operating procedures and to make proper decisions.
- Regular training sessions to assure effectiveness of the process. The sessions should include practice-drill scenarios incorporating all requirements for regional reliability and governmental policy.
We have technology in hand today that would allow us to restore power in any system much faster than we did in 2003—in a few hours, at least for most human-caused, wide-area cascading events. The problems in 2003—and today—are that too many generators trip off line after just a small deviation from normal conditions. Many units lack what is called black-start capability—the ability to go on line all at once, without elaborate preparation. Other units take still longer—nuclear power plants (for reasons of security) and steam turbines (because of their allowable ramp-up rates). Then again, some local features designed to ensure against local outages—like the multiple lines of supply to Manhattan—end up making it hard to restore power after a general blackout.
We must also inform the public of the standards the power companies are held responsible for. People have a working knowledge of the reliability of their local road systems, and that lets them plan their commutes. Yet even the most sophisticated consumers of electricity, such as manufacturers, do not have an accurate understanding of the reliability of their power supply. We must work to model our system more effectively, to estimate the chance of local and regional blackouts and brownouts more accurately. We should invest in developing on-the-fly systems to model and predict such recovery periods, so that we can tell consumers what to expect, enabling them to plan their own operations accordingly.
It is more important than ever to find ways to project transmission and distribution growth, to identify cost-effective solutions, and to determine criteria to guide prudent investment decisions. Among the principal areas to address are:
- Streamlining the right-of-way access for transmission, vegetation management, environmental impact, and recovery of “stranded investments,” like those in nuclear power plants that no longer are profitable.
- Asset management and maintenance—we should design the system to facilitate maintenance and assure the dependability of the constituent elements.
- The price of reliability—that is, the costs and risks that transmission owners and customers are willing to assume. The power industry is accustomed to optimizing investments and evaluating return on expenditures based primarily on financial aspects of trading energy and serving load within certain reliability criteria.
- The scope of transmission planning and decision making—large regional geographic areas should be included; true beneficiaries should be identified, and allocation of costs determined.
- Distributed generation—“local” or “micro” generation should be used where appropriate, as well as longer-distance transmission.
The challenges are immense, and they would be hard to meet even if the world stood still. A recent report by the U.S. Department of Energy had demand for transmission infrastructure in the United States growing by 45 percent in the next 20 years, if only to service the estimated 1300 to 1900 new generating plants that are scheduled to be built in that period. We still lack any effective strategy to carry energy from those plants to users.
The findings of major disturbances around the world highlight the need for cross-regional grid reforms, so that the best available technology is promptly put to use, without lengthy delays arising from the legislative or regulatory process. The reform process will recognize squarely that all interconnected systems will experience major blackouts if operated outside intended design limits.
These principal observations are not new findings. However, as end users we must accept the price for reliability or the consequence of unavailability.
About the Authors
Vahid Madani, an IEEE senior member, is a principal protection engineer at Pacific Gas and Electric Co., in San Francisco. He has served as chairman of the Remedial Action Reliability Subcommittee of the Western Electricity Coordinating Council. Damir Novosel is division president of the international consulting firm KEMA T&D, in Raleigh, N.C.
To Probe Further
For a more extensive treatment of these authors’ views, see “Taming the Power Grid,” IEEE Spectrum Online. See also V. Madani, D. Novosel, A. Apostolov, S. Corsi, “Innovative Solutions for Preventing Wide Area Cascading Propagation,” IREP Symposium, Cortina d’Ampezzo, Italy, August 2004 athttp://www.polito.it/irep2004. The official assessments of the 2003 U.S. and Italian blackouts are athttps://reports.energy.gov andhttp://www.ucte.org/pdf/News/20040427_UCTE_IC_Final_report.pdf.