Did Bad Memory Chips Down Russia’s Mars Probe?
Moscow blames radiation wreckage on an SRAM chip, but does it add up?
16 February 2012—The failure of Russia’s ambitious Phobos-Grunt sample-return probe has been shrouded in confusion and mystery, from the first inklings that something had gone wrong after its 9 November launch all the way to inconsistent reports of where it fell to Earth on 15 January.
What was never mysterious was how important the mission goal was—to land a probe on the Martian moon Phobos and then return soil samples to Earth. It was to have been the flagship mission that vaulted Russia back into prominence in interplanetary exploration after a quarter century of disappointment and delay, but it quickly turned into a heartbreaking debacle. On the heels of a woeful parade of other space failures, the mission cast an ominous shadow over the entire Russian space industry.
The release of the official accident investigation results on 3 February served only to further rumors of fundamental hardware and software design flaws, and of blatant violations of safety standards. The report blames the loss of the probe on memory chips that became fatally damaged by cosmic rays. The probe died so suddenly that it didn’t even send an error message, but investigators concluded the only plausible failure mechanism was the simultaneous disabling of two identical chips in the dual-computer control system, causing both to restart simultaneously. This in turn led to the autopilot going into “safe mode” while maintaining the spacecraft’s orientation to the sun. (That reorientation was observed in the ensuing days as thruster firings disturbed the probe’s orbit.)
Phobos-Grunt was supposed to await further instructions from Earth, but it never received them; in an incredible design oversight, the probe could receive emergency instructions only after a successful departure from parking orbit.
Section 2.3 of the report provides insight into where the computer malfunction that doomed the probe came from: “The most likely factor which caused a ‘double restart’ was a local influence of heavy charged particles from space.” Known as galactic cosmic rays, these particles are the nuclei of heavy atoms moving at near light speed after being spit from the hearts of supernovas. Earth’s magnetosphere and atmosphere provide protection from such radiation at the planet’s surface.
Press reports suggest that investigators thought the chip failures were a result of counterfeit components—lesser circuits labeled with higher performance qualities than they actually had. But the final report does not mention this possibility. Vladimir Popovkin, head of Roskosmos, the Russian space agency, was careful to say in interviews (such as on the radio show “Echo of Moscow” on 2 February) that although chip counterfeiting was a widespread problem, “we cannot say that the chips there were counterfeit.”
The radiation environment of outer space can certainly be hazardous to space vehicles. To assess the credibility of the Russian conclusions, IEEE Spectrum contacted Steven McClure of NASA’s Jet Propulsion Laboratory (JPL), in Pasadena, Calif. McClure is the supervisor of the Radiation Effects Group, which is NASA’s first line of defense against the threat that Roskosmos says the probe fell victim to.
At Spectrum’ s request, McClure read a translation of the official Russian report. He immediately recognized the specific component identified in the report as the likely locus of the double-hardware failure—the WS512K32, which is a single-package assembly of SRAM totaling 512 kilobytes. There are probably four chips in this bi-32 device,” he explains. “They were identified in a report by Joe Benedetto [an industry specialist in radiation hardening] a few years ago as some of the most sensitive parts to single-event latch-up they had ever seen.” Single-event latch-up occurs when a charged particle passing through a semiconductor causes a high current to flow through it. Generally, the device will be stuck in that state until the chip’s power is cut off and turned back on again, but in some cases, the chip may be permanently damaged.
The WS512K32 is “sold on the aircraft market to a military grade—not the space market,” says McClure. He points out that neither the original fabricators nor the commercial vendors test for radiation, and they would not give radiation specs. If this chip had been proposed for a critical component in a space-probe design at JPL, he assured Spectrum, “it would not likely be approved for use.” McClure says that it would be okay for a space mission of a couple of days or for noncritical applications but not for a years-long mission to Mars and back, which would typically “require a probability of failure of less than 1:10 000 [for the] entire mission.”
While the device is known to be susceptible to radiation, McClure is wary of the assertion that it failed so soon after launch. The chances of two such components experiencing two identical but separate errors in 2 hours, he says, are “pretty unlikely.”
Because the failures happened in such a short time frame, McClure suspects that Phobos-Grunt was done in by software. “Most of the times when I support anomaly investigations, it turns out to be a flight-software problem,” he says. “It very often looks like a radiation problem, [but] then they find out that there are just handling exceptions or other conditions they didn’t account for, and that initiates the problem.”
And indeed there are persistent press reports in Moscow of difficulties in the development of the probe’s flight-control software. The software development was particularly challenging, because the reliable computer that controlled the Fregat upper stage, which is used to push the probe out of parking orbit, had to be replaced. Designers decided to keep that stage attached all the way to Mars—a mission hundreds of times as long as any of the Fregat’s previous missions—and use it for orbital insertion around the destination. Because of the change, the Fregat’s flight-control functions were assigned to the probe’s own computer, and special cabling was installed to connect it to the Fregat.
According to postings on the website of the magazine Novosti Kosmonavtiki and the newspaper Moskovsky Komsomolets, there were last-minute discoveries of flaws in the cable routing that required repairs only days before launch. Reportedly staff were also loading last-minute coding patches even when the probe was at the launch site.
Roskosmos’s focus on chips is understandable. In recent years, Russian space officials have lamented the failure of the country’s domestic semiconductor industry to keep up with foreign manufacturers. This has compelled the space program—and also the military forces—to acquire so many foreign electronics components that they make up more than 80 percent of the chips in some space systems. (McClure estimates that non-U.S. chips compose only up to 5 percent of components of NASA’s Curiosity rover, now en route to Mars.)
Popovkin told the Novosti news agency on 2 February that the microchips used to fabricate the Phobos-Grunt flight control computer were not checked for radiation hardness. The probe “was designed and built in 2005 to 2006, before taking a decision on the need to test hardware components on the effects of heavy particles,” he told the agency. Yuriy Koptev was even more blunt when speaking to a reporter for the Russian army newspaper Kransaya Zvezd. In assembling the probe, he said, 95 000 microchips were used, of which 62 percent were not qualified for spaceflight.
Roskosmos has blamed supposedly counterfeit chips for a plague of on-orbit breakdowns. When a Proton booster carrying three GLONASS navigation satellites crashed late in 2010 due to an astounding fueling error, the costly disaster was widely publicized. But a year earlier, the system was victim to a quieter failure when four GLONASS satellites broke down. The newspaper Izvestiya reported on 12 August 2011 that the Prosecutor General’s Office in Moscow concluded that the failure was due to shoddy chips from Taiwan that weren’t radiation hardened. The newspaper’s veteran aerospace journalist Ivan Cheberko quoted Ivan Moiseyev, scientific head of the Institute of Space Policy, admitting that “the components are a fundamental problem” for Russia’s quest to produce long-lived spacecraft “using foreign components—since there is no place for us to get our own.” Concluded Moiseyev: “One can hope that this will work out for us.”
Considering this string of failures, there has been public skepticism of the Phobos-Grunt report’s conclusion that cosmic rays doomed the mission. Nikolai Vedenkin, a researcher at Moscow State University who supervised development of instruments for space use, couldn’t believe the official reasons: “Either the whole system was configured wrong, or it’s simply not true,” he told Novosti on 3 February. Vedenkin was familiar with the confused and rushed preparations, and it was his opinion that “the mistakes snowballed, and the probe was launched when still not ready.”
NASA’s McClure says that a great number of things can go wrong in a mission as complex as Phobos-Grunt, and he didn’t want to criticize or assign blame: “This is difficult stuff we do,” he says. “Even when the best effort is put in to assure mission success, failures do occur.”
About the Author
James Oberg worked as an aerospace engineer at NASA for 22 years. He switched to journalism in the late 1990s and now makes his living reporting on space for such outlets as Popular Science, NBC News, and of course, IEEE Spectrum. In the June 2011 issue, he looked at some of the gutsier (and goofier) shuttle missions that never flew.