James Oberg reports in an IEEE Spectrum webcast a very important story on the background to the NASA computer failure that occurred in June. Oberg stories states that, "The critical computer systems ... had been designed, built, and operated incorrectlyâ''and the failure was inevitable. Only being so relatively close to Earth, in range of resupply and support missions, saved the spacecraft from catastrophe."
The problem was a cable short-circuit caused by moisture build-up, likely itself caused by a malfunctioning dehumidifier. But as Oberg writes, the short-circuit should not have caused the problems it did. "..in a shocking design flaw, there was a â''power offâ'' command leading to all three of the supposedly redundant processing units. The line was designed to protect the main computers, which are downstream of the power monitor, from power glitches too great for normal power filters to protect against. It does so by turning the computers off when it senses trouble. But in a failure unanticipated by its designers, this one command path itself was able to kill all three processing units due to a single corrosion-induced short."
As Oberg noted, if this happened on the way to Mars, it would likely have resulted in loss of the crew. What's worse, was the instinctive reaction of those involved to look for assigning blame instead of looking for the root cause of the problem, or a means to mitigate it.
Everyone interested in risk assessments, communication and management should read it.