NASA's Big Push for the Space Station
But glitches persist, and with no time for delay, 'workarounds' are the accepted procedure
To the relief of space workers in Russia and the United States, Russia's critical Service Module and its life-support hardware were linked successfully to the International Space Station on 26 July. Within two weeks, a robot supply ship docked to the still-unoccupied station, followed on 12 September by a week-long unpacking and outfitting expedition by the seven astronauts and cosmonauts sent up to ready the station for its permanent crew.
Now it is NASA's turn on center stage. Using its space shuttle, it must deliver a series of three critical missions, designated 3A, 4A, and 5A, to the now-livable station [see "Space station missions,"]. In parallel with these deliveries, the first permanent crew is due to be sent to the station [Fig. 1, above] aboard a Russian Soyuz spacecraft by early this month.
As with any highly complex project, different kinds of problems continue to pop up even this late in the game. But NASA managers are taking them in stride. "In a program this big, hardly a week goes by that we don't have a situation we need to work on and make sure that everything is going to be OK," explained Tommy Holloway, NASA's space station program manager, at a press conference in Houston on 26 September. "At this time, we don't have anything we would classify as a show-stopper."
Others are less upbeat. For example, even optimists admit the development of space station software has been difficult. Software glitches in the hundreds have cropped up, so much so that pessimists believe delaying the NASA missions should be considered. If NASA presses on, a source told IEEE Spectrum, "we may very well spend a lot of time putting our fingers in an ever-increasing number of holes in the dike."
One problem can lead to another. Inadequate software has led to late delivery of training simulators for both the ground teams and the astronauts. Without the software, NASA's plan to test the space station hardware as effectively as possible on the ground could not be fully carried out in time for the planned launches. Such testing would have revealed many potential problems in the equipment soon enough to fix them, circumvent them, or learn to live with them.
Of course, the space station itself was designed to have backup systems, margins, and reserves for unexpected stresses. Above all, NASA wanted the crew to have the flexibility to work around last-minute problems manually. That last feature is one of the often advertised key benefits of a manned space vehicle. But though NASA planners are counting on trained crews and work-around procedures to overcome known deficiencies in space station hardware and software, both those crews and those procedures are less prepared for the expected challenges than at any other time in the history of U.S. manned space flight.
"We launched Space Shuttle when we were 90 percent ready," one veteran space worker told Spectrum, "but we're launching Space Station at only 50 percent."
While concerned about the risks, most space experts who talked privately with Spectrum agreed that further waiting would be unlikely to reduce risks; it was time to do a shakedown under real flight conditions. Because of the orbital stability of the space station and the presence of the fully functioning Russian modules, the actual threat of vehicle or crew loss due to hardware and software problems is lower than for any previous U.S. manned space mission. Frequent problems, including major failures, are expected during the next few months of assembling the station, but experts believe it is an ordeal that must be overcome.
The space station thus far
NASA's upcoming missions to the International Space Station missions will add to the three modules already in orbit [Fig. 1, top of page]. The Russian-built, U.S.-financed FGB (Funktsioniy-Gruzovoy Blok in Russian, or Functional Cargo Block), code-named Zarya, was launched in November 1998. It provides power from solar arrays and for propulsion boosts. Zarya has been serviced and stocked with supplies by three subsequent visiting Space Shuttle missions.
The U.S. Node module, code-named Unity, carried up on a space shuttle mission in December 1998, has six attachment ports (one at each end, four around the waist) for mating with other sections. The Russian Service Module (SM), code-named Zvezda and launched last July, provides life support for crew members who will remain permanently aboard the complex, beginning early in November.
What cropped up
The most serious problems in this stage of the project include:
- Electric power shortages caused by a last-minute need to run heaters to warm electronic components vulnerable to cold (Mission 3A).
- Concerns over electric shock danger to space-walking astronauts from failure of a device to equalize voltages between the station and the surrounding space plasma (Mission 4A).
- Anxiety over unexpected degradation of crucial optical-fiber data lines (Mission 5A).
- Frustration with very late and error-filled control software (also Mission 5A).
- Concern over inadequately trained personnel, both on the ground in Mission Control and in space, and over poorly checked flight procedures (all missions).
For each hardware or software problem, NASA and its prime space station contractor, Boeing Co., Seattle, Wash., reported they had developed backup techniques that they expected would allow the station and its crew to function safely and effectively, or to recover smoothly. But Boeing and NASA are finding that a "fix" can exact its price.
Some unintended consequences have occurred, for instance, because of a fix made for the control moment gyroscopes (CMGs). Key components, CMGs are needed to orient the space station--that is, to control its attitude in space. By using electric motors to vary the speed and orientation of massive flywheels, control computers can cause the station to turn in desired directions without the use of steering rockets. Four of these CMGs are included in the Mission 3A package launched in mid-October that brought up two space station modules [Fig. 2]. But they will not be activated until the full computer suite arrives on Mission 5A.
During low-temperature testing on similar hardware earlier this year, a sensor that measures the flywheel spin rate failed. Engineers concluded that the epoxy mounting of the Hall-effect sensor had cracked. Replacing these components on the flight hardware might have taken more than a year, because the improved version would have to be developed and tested. And as big a delay as that was unacceptable.
Instead, NASA opted to raise the operating range of thermostats controlling heaters in the CMG units. This would keep the hardware much warmer than the fatal level in the initial tests. (Later testing indicated that the failed unit had been much more susceptible to thermal stress damage than were higher-quality flight-qualified units manufactured subsequently.)
The electric power budget for the station in the "post-3A" interval was already extremely tight. The newly added U.S. hardware would thus have to depend on the solar arrays of the Russian segments for all its power. It turned out that if NASA allocated enough power to run the heaters to keep the CMGs safe, there would not be enough power to heat the U.S. Node module. Unheated, that module would be exposed to water condensation from the air when astronauts were inside it, a situation that could damage electrical components.
As a consequence, when the permanent astronaut crew arrives this month, they will find the U.S. node, dubbed Unity, sealed off. Replacing it will be the more cramped, though still adequate, living quarters in the two Russian modules. These comprise the Functional Cargo Block, code-named Zarya, and known by its initials in Russian, FGB, and the Service Module (SM), Zvezda.
Minimizing the impact of closing the U.S. Node, NASA experts admitted that power had already been too tight. "Even before the CMG problem, we probably didn't have the power to keep the Node open," astronaut Robert Cabana, now director of space station operations, told Spectrum.
Still, adding the CMG heater requirements makes tight power budgets even worse ("severe power concerns" is how one NASA expert described it). If there are further failures of the recently replaced nickel-cadmium batteries in the FGB and the Service Module, there could be brownouts throughout the station. Flight experience has shown that a high failure rate can be expected, and, in fact, a second Service Module battery failed shortly before Mission 3A was launched. It might even become necessary for the crew to power down the FGB and retreat into the Service Module until Mission 4A brings up U.S. photovoltaic arrays that will fix the power deficit.
"We will need to have good FGB batteries operating and spares when the crew arrives," a NASA planning memo stressed. But extra batteries have been needed before.
"One of the biggest challenges with the FGB is keeping the batteries functional," a top space station operations official told Spectrum earlier this year. "NASA thinks the failure rate is high, but during [the Shuttle-Mir missions in 199698] we delivered twenty-six or twenty-seven 800-A batteries and 15 [power controller] modules to Mir, so the battery failures shouldn't be as much a surprise as they are. It's just that the people who are working the related issues within the space station are not thesame people who worked the issues on [Shuttle-Mir]."
Meanwhile, the CMG-induced solutions continue to ripple into new problems with their own costs. A NASA Mission Control specialist told Spectrum that one of the conditions for tolerating the notorious "noise problem" in the Russian segment (noise levels that could reach physically damaging ranges) was that "the crew would always be able to retreat to the Node and get away from it." But the solution to one problem invalidated a previous solution to another problem. Now, at least for the near term, the Node would be unavailable.
Faulty fix leads to shocking hazard
Space vehicle designers generally have preferred to use power systems that operate in the 24- to 28-V range. But because the space station would use much more power and would be so much larger physically, NASA engineers decided to operate the buses from the solar arrays at a heftier 130180 V. This option reduced current, wiring, component weight, and, ultimately, cost.
But designers soon realized they had created a new problem for the station. Orbiting Earth, the station flies at 8 km/s through a very thin plasma, not a total vacuum. Since the power system was grounded to the structure, the outer skin of the station developed a relative potential of 100160 V and up with the surrounding plasma. According to a NASA training manual on the station's power system, "40 to 60 Vdc has been observed to be the minimum required for arcing." With its outer skin at 100160V, the station would be way over that threshold.
Such a high voltage would have resulted in continuous "mini-arcing" across the skin of the station. It would damage the station's thermal coating as well as viewing ports, and would degrade the solar cells. More dangerously, it would be a shock hazard to crewmen on space walks.
To resolve this unacceptable situation, NASA built a Plasma Contact Unit (PCU) to ground the station into the plasma by shooting a stream of ions into space [Fig. 3]. A hollow-cathode emitter fires ionized xenon; each emitter operates at 435 oC and is surrounded by a protective metal cage to prevent contact with space-walking astronauts. In case something goes wrong with one of them, two of these 170-kg units will be carried on Mission 3A. But they will be needed only when the high-voltage solar arrays from Mission 4A go into operation.
Methods of controlling the buildup of charge other than the PCU were investigated, a NASA reference manual states, "but would require major Electrical Power System redesign." Building the PCUs was "the most cost effective method"--or so it seemed at first.
Recently a NASA safety team reviewing this solution discovered some faulty thinking. What would happen, they asked, if the operating PCU were to break while astronauts were outside the space station doing assembly or repair work? Analysis revealed a shocking answer: without a functioning PCU, the charge differential would exceed the arcing threshold within seconds. A space-suited astronaut would act like a lightning rod, and could experience up to a potentially death-dealing full ampere of induced current through the suit and into his body.
To repair the faulty "fix," another tiger team of specialists looked for procedural workarounds that would prevent such a hazardous situation from developing. The plan they developed would require that both PCUs be turned on before every space walk; NASA engineers verified that this was technically feasible. Then, if one PCU broke, an emergency procedure would shunt the high-voltage solar arrays off-line and turn them edge on into the space plasma flow, reducing the voltage differential below the arcing threshold. The crew would re-enter the station as quickly as possible.
But engineers discovered another problem, which kept this workaround from working--any breakdown of the PCU was not automatically "annunciated" to the station's computer or to Earth. Major software changes would be needed to add such a capability.
And there was another catch. The standard procedure for failure of a PCU was a space walk to replace it with a spare. This work-around plan required both PCUs to be functional before allowing the space walk to begin, a Catch-22 in orbit.
But the planners had a second set of procedures. For a space walk to replace a failed PCU, the solar arrays would be taken off-line before the crew exited the station. The station would rely on battery power and the 28-V solar arrays on the Russian modules until the astronauts got the second PCU installed. They then would have to re-enter the station while the new unit went through a days-long conditioning process to activate its hollow-cathode emitter.
These plans have been accepted as the means for getting by with the deficiencies in the existing voltage differential mitigation hardware. They will take effect after Mission 4A, when the first U.S. solar arrays come on-line.
Outgassing affects optical-fiber cables
Early this year, word began spreading of serious problems with the optical-fiber cables within the U.S. Lab, the keystone of the space station, to be delivered on Mission 5A. The fiber system is one of three separate communications systems that transmit payload data within the module. The high-rate (100-Mb/s) fiber lines route the science data to downlink antennas or to other payload instruments. A low-speed line allows for control and monitoring functions by the crew, and a medium rate Ethernet line lets the crew monitor selected subsets of real-time data.
Trouble started when the optical-fiber lines began degrading, becoming more brittle due to the damaging effects of insulation outgassing, one expert informed Spectrum. Another payload engineer told Spectrum the flaws were caused by etching prior to application of a carbon coating necessary to bond to the inner sheath. (It was found that just coiling the fiber during the sheathing process would also break it.) Boeing space station officials declined to comment when approached by Spectrum, and NASA would not provide the identity of the vendor.
To help solve the problem, Boeing is working on ways to replace each fiber bundle with undamaged ones while in orbit. Doing this with the Lab still on the ground would take too long, and it would cost too much to rip out the existing fiber, replace it, and follow NASA test procedures, Spectrum was told. Another expert reported that the time to replace and recheck all the fibers was estimated at 18 months. Instead, only those that actually break during the stressful rocket ride into orbit will need replacing. Replacing some of the Lab fiber bundles, however, involves depressurizing the Lab, another potential hurdle.
Initial rumors regarding the fiber optics were overblown and the problem has been resolved, NASA officials insist. "We replaced some cables that we found had broken," NASA's Tommy Holloway told Spectrum. "They had been abused, bent too much, beat up during installation."
All the lines have now been thoroughly tested, Holloway reported, and are fully functional. He confirmed that there had been "concern" over unexpected degradation in the lines during fabrication. "We've recreated the process of building them, and during sheath application some chemical exposure reduced the tensile strength of the lines," he said. "There was no impact on transmissivity. We're now confident it's not going to deteriorate any more, so we're ready to launch."
Glitches in the distributed computer system
The computer architecture for the U.S. side of the space station involves a multi-tiered array of distributed computing functions, with crew interfaces provided through plug-in laptops. One set of laptops (IBM 760 Thinkpads) supplies the crew with displays of station status, and relays their commands to the main computer network. A second set gives them access to the library of crew procedures and schedules, the inventory management system, and word processing.
When Mission 5A carries the U.S. Lab up to the space station, it will bring more than a dozen computers (based on an Intel 386 SX chip) and about one million lines of code. The computers control practically everything except locking the hatches and flushing the toilets. They automatically run the power generation and distribution system, the station's attitude control in space, its environmental control systems, and communications management. At the same time, they monitor the status and safety of thousands of components.
Consequently, the threat from software errors can be significant, because it is not merely a matter of the system not doing what it's commanded, or not exchanging required information--sometimes it can do things it was never ordered to do.
In one error discovered earlier this year, the corruption of two adjacent flags (bits in a status word) would command an air valve to open while locking out the "valve close" command; only a power cycle could reset the system and prevent all the air from leaking out. In another dangerous flaw, a pressure-monitoring routine failed to recognize a normal airlock depressurization for a space walk. Assuming the whole station was leaking, it issued orders to dump emergency oxygen into the modules.
The laptops themselves were not immune to problems. Although "hit any key" was supposed to wake up the laptops from quiescent mode, hitting the space bar actually froze the display and required re-booting. "There are so many software discrepancies, it's a challenge to list them all and make sure they're in the procedures documents," an astronaut expert on station software told Spectrum. Flight crews, he added, were concerned with the large number of software workarounds and limited testing in high-fidelity environments.
Schedule-obsessed management, one worker told Spectrum, "just doesn't understand that the hundreds of errors found all the time will impact us later. Every time Boeing/NASA won't fix a bug, they write a 'station program note,' or SPN, and for 5A we have hundreds of these things. There is no way to learn all these, and I know somebody will screw up because of it. It is hard because the crew needs to have all these [written] in procedures and it is killing us to write all these workarounds." He added that many of the bugs were interconnected and could lead to additional flaws.
Workers have also told Spectrum that unrealistically early "freeze dates" for software created problems when the need for fixes became apparent. First, there were bureaucratic obstacles to modifying prematurely frozen software. Second, the integrated testing on flight hardware had been done too soon and gave little confidence that highly modified versions of the software would still work properly. "The chance to check out the real software with the real hardware [before it is launched] has been lost," one worker explained.
Premature integrated tests
The distributed nature of the system requires integrated testing, one 20-year NASA veteran told Spectrum. "Once the integrated test is complete, a change to code in any single box invalidates all the code interfaces," he said. Yet the best that could be done was to test only the parts of the code that had been directly modified, even though there were at least 700 major changes in the U.S. Lab software.
NASA project officials remain optimistic. "The software is quite mature and is being exercised with the hardware on a near-continuous basis," former astronaut and now Boeing executive Brewster Shaw told Boeing employees earlier this year. "The problem report rate is significantly reduced," he said. (Before and after figures were not disclosed.) And another expert told Spectrum that "while the software is not perfect by any means and the sheer number of workarounds and software deficiencies is mind-numbing, it should not delay the launch of the U.S. Lab."
Added another expert: "In a program this size, it's just a fact of life it's going to go up imperfect." Software problems will be fixed through patchwork upgrades over a period of at least a year. The first crews have trained with long lists of existing flaws and with advisories about what processes probably will not work right.
The delivery of error-filled control code was also late, a disadvantage that has had an impact on much more than just the space station. In a NASA status report last August, the preparation of the mission operations team for Mission 5A was listed as RED, in the words of the report, a cause for mandatory repairs. The low rating was given because of a "cumulative risk of flight controller certification threat [each flight controller must pass a strictly defined training plan involving complex simulations and study of software documentation]." Adding to that was "the lack of an adequate multi-segment simulator for flight controller training and verification of procedures requiring a high level of integration."
For more than 20 years, NASA flight controllers have trained effectively for shuttle and Spacelab missions using sophisticated computer models that feed simulated spaceflight data to the real Mission Control Center consoles in Houston. The simulations are directed by training personnel who script various contingencies and failures to test the flight team's ability to detect, recognize, and react to them properly.
Flight controllers (and this author was one of them for many years) subscribe to the old Chinese maxim that "the more you sweat in peace, the less you bleed in war." The Russians have a similar proverb, attributed to their general who beat Napoleon: "Training is hard so fighting is easy."
But in describing attempts to hook a space center simulation into the existing setup, more than one NASA veteran has been extremely uncomplimentary.
"The [space station] software is extremely 'brittle'," one expert explained, and it's hard to tell if it's real, or just a problem with the simulator. In space station stand-alone simulations, he said, the training team "can often put in just one well-placed malfunction at the start, and watch the whole station fall apart while the flight control team flails [NASA jargon for struggle ineffectively]." In fact, some teams never figure out how to solve the simulated problem. Because of that track record, during joint shuttle-station simulations, "the team is often scared to put in any malfunction at all" because it wastes the whole day with useless dithering.
"This has had predictable effects," the expert added, "on flight controller proficiency and procedures verification." Astronauts have to carry massive reference books into space on logistics missions, but also make their own notes on such handy surfaces as their hands [Fig. 4].
Yet, as week followed week in September and October, Spectrum sources began to report a steady day-to-day improvement in the simulation and flight software, as well as slow progress in certifying both the workforce and the workarounds on which the success of this effort will depend.
In large part, NASA's decision officially to fly "with risk," in the words of one official report, is justified because of the nature of the missions. Unlike practically all previous NASA manned space missions, this one involves no life-or-death short-term urgent decision points. The space station will be in a stable orbit, giving engineers time to recover from system crashes and even design flaws--and the three-man first flight crew has bluntly predicted they will spend 80 percent of their time repairing system shortcomings. Reassuringly, the U.S. segments are attached to fully functioning Russian modules so the ultimate down mode for any software crisis is to revert to Russian-only control for the hours or days (or longer) it takes to recover.
This approach has to work, space workers believe. After complaining about Russian delays for more than two years, NASA seems absolutely committed to launching its three big sections on Missions 3A, 4A, and 5A at the promised times. "So right now we are still very optimistic that the schedule we have in front of us is makeable," Holloway concluded at the September press conference.
"And if it's not, it's a matter of days, not months. Things are OK," Holloway promised, as he rushed from the conference to confront another last-minute crisis.
About the Author
James Oberg is a 22-year veteran of NASA Mission Control in Houston and now a writer and consultant. His December 1999 article in IEEE Spectrum, "Why the Mars probe went off course," recently received the silver medal for news articles in an annual competition sponsored by the American Society of Business Press Editors in Washington, D.C.
To Probe Further
Articles about the International Space Station and the NASA missions are available at the NASA Web site.
An informative site about the space station is maintained by The Boeing Co.