The Virginia Information Technologies Agency (VITA) server problem that caused havoc across some 26 Virginia state agencies for a week last year apparently was aggravated by human error by the prime contractor Northrop Grumman, a detailed (and reader unfriendly) audit report (here in PDF) of the incident released by the state this week states.
The incident started on 25 August 2010 at 1008 EST with errors being reported on two memory cards in a data storage system (EMC DMX-3). A decision was made to replace one of the cards, but the engineer who replaced it did so in the wrong sequence. This mistake causes database errors to soon be reported, which then subsequently leads to other remedial actions being taken that ended up causing the loss/corruption of the backup data. It takes another two months to fully recover the lost/corrupted data.
Northrop Grumman's lack of adequate risk management processes in regard to continuity management and data recovery also contributed to the length of the outage, the audit report says.
The audit makes 26 recommendations, not all of which Northrop Grumman agrees with, reports this article in Government Technology. The article says that Samuel Abbate, Northrop Grumman vice president for the VITA Infrastructure Partnership told Government Technology that:
"Some of the [audit’s] most important recommendations are matters for the Commonwealth’s policy makers to consider, for example establishing a common definition of critical data and determining what protective measures should be undertaken... Also, the report observes and is critical of the fact that typical enterprise approaches are not always applied in the program. These observations may be fundamentally correct, but individual agencies, not VITA or Northrop Grumman, retain the ability to deviate from centralized enterprise standards."
In other words, don't only blame only us, blame those other state agency folks, too.
The audit report also states that:
"The dual memory board failure was reported by EMC to be caused by an electrical over stress condition at the component level. The reason for the over stress is not known."
Back in September of last year, VITA CIO Sam Nixonclaimed that the EMC server problem experienced was so rare that it only happens 1 out of every billion hours of computing time (or about 114,000 years). Me thinks he - and EMC - need(s) to update that stat a bit.
I wonder if belief in that stat seduced Northrop Grumman into not worrying too much about the effectiveness of its continuity management or data recovery plans.
Robert N. Charette is a Contributing Editor to IEEE Spectrum and an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.