Space Station: Internal NASA Reports Explain Origins of June Computer Crisis
A mistake like that on the way to Mars would be fatal
4 October 2007--Aboard the International Space Station, the three Russian computers that control the station's orientation have been happily humming away now for several weeks. And that's proof that the crisis in June that crippled the ISS and bloodied the U.S.-Russian partnership that supports it, has been solved.
But the technological--and diplomatic--lessons of that crisis need to be fully understood and appreciated. Because if the failure had occurred on the way to Mars, say, it probably would have been fatal, and it will likely be the same international partnership that builds the hardware for a future Mars mission.
The critical computer systems, it turned out, had been designed, built, and operated incorrectly--and the failure was inevitable. Only being so relatively close to Earth, in range of resupply and support missions, saved the spacecraft from catastrophe.
During the first days of the computer failure in June, the station's atmosphere control system seized up. The failure also knocked out the autopilot's ability to fire maneuvering thrusters to hold the station steady during the undocking of the space shuttle, which had arrived on 10 June. The terse description in the NASA internal technical report on the crisis, obtained by IEEE Spectrum, put it this way: ”On 13 June, a complete shutdown of secondary power to all [three] central computer and terminal computer channels occurred, resulting in the loss of capability to control ISS Russian segment systems.”
Russian officials were quick to blame NASA for ”zapping their computers” with ”dirty” 28-volt power from a newly installed solar power wing. Another Russian explanation was that the expanded station structure (the main purpose of the shuttle visit) might be excessively charging up due to its orbital speed through Earth's magnetic field. These were the first of many bad guesses by top Russian program managers that would distract engineers trying to get at the real problem.
The initial assumption was that some external interference, such as noise on the power supply, was responsible for generating false commands inside the computer system. On the assumption that the bad commands were coming from inside a power-monitoring device, the crew bypassed it on two of the three downed computers, using jumper cables. By the time the shuttle undocked on 19 June, the computers began to function normally--or so it seemed. Replacement parts were quickly manifested on a robot supply ship, while ground engineers wrestled with the fundamental question of cause and effect.
Analysis teams still had to determine why the computers failed, and why the jumper cables seemed to fix the problem. More important, they needed to know whether the problem really was fixed, or whether something could again trigger the systemwide crash of the supposedly triply redundant architecture.
In the weeks that followed the crisis and apparent recovery, station commander Fyodor Yurchikhin and his fellow cosmonaut Oleg Kotov disassembled the boxes and cabling and inspected every angle of the hardware, occasionally assisted by their American crewmate, Clayton Anderson. Multiple scopes and probes had failed to find the flaw, but their eyes and fingers eventually did.
The connection pins from the power-monitoring device they'd bypassed earlier, they found, were wet--and corroded. The final report described the ”change in appearance” of fasteners on one box's connectors and noted ”the presence of deposits and residue on the housings, and residue and spots on the contact surfaces.”
Continuity checks found that specific wires, called command lines, in the cable coming out of the device had failed. And one of those lines had short-circuited. Also, in a shocking design flaw, there was a ”power off” command leading to all three of the supposedly redundant processing units. The line was designed to protect the main computers, which are downstream of the power monitor, from power glitches too great for normal power filters to protect against. It does so by turning the computers off when it senses trouble. But in a failure unanticipated by its designers, this one command path itself was able to kill all three processing units due to a single corrosion-induced short.
That discovery was a great relief to spacecraft controllers in Houston and Moscow. The bypass jumper cables were exactly what really was needed to circumvent the false ”power off” command, because they forced that command line to remain dormant. Using the cables did expose the computers to damage from real power surges, but by then the power system had settled into a benign and steady state.
But what caused the corrosion? The source was quickly identified: water condensation, one of the most frequent culprits in avionics problems. The NASA report says the damage ”presumably” was ”the result of repeated emissions of condensate from the air separation lines” of a nearby dehumidifier. Air flow and power usage were supposed to keep the computer cables warm enough to prevent water from condensing on them, but the dehumidifier had been malfunctioning, and its frequent on-off cycles led to surges of water vapor. Also, a stream of cold air from another location on the dehumidifier helped drive the cable temperatures occasionally below the dew point.
During the August shuttle visit, the Russians were able to turn stabilization control over to the American spaceship and tear down their old computer network. The boxes and cables were replaced with fresh units, built and supplied by the European Space Agency and sent up inside a recently launched robot supply ship.
”Upon removal of the old unit, the crew reported that there was cold condensate behind it,” notes an internal NASA ISS status report for 12 August obtained by IEEE Spectrum. ”Drops of humidity and mold were discovered. The unit itself is humid.”
To add to their headaches, the cosmonauts discovered that one of the new cables was about 40 centimeters shorter than the one it was supposed to replace--and it wouldn't reach. After careful visual inspection of the original cable , the cosmonauts decided there were no signs of corrosion, so there was no need to replace it. They also decided to rig a thermal barrier out of a surplus reference book and all-purpose gray tape. As a last step, they removed the jumper cables, verified the system was functional, and closed the access panels.
It is dismaying that after decades of experience with manned space stations, Russian space engineers still couldn't keep unwanted condensation at bay. But what's worse is that they designed circuitry that would allow one spot of corrosion to fell a supposedly triply redundant control computer complex. Another cause for dismay is that when trouble did develop, the Russians' first instinct was to blame their American partners. Such deficiencies need to be worked out in the years ahead, on the space station, before both the technology and the diplomacy can be thought reliable enough for far-ranging missions that replacement shipments wouldn't be able to reach.
About the Author
JAMES OBERG, a 22-year veteran of NASA mission control, is now a writer and consultant in Houston.