IT glitches—both technical and human-caused—occurred in abundance this week. On the same day that it was disclosed that UK government regulators at the Financial Services Authority were planning to tell all UK banks that they better gain control of their IT infrastructure systems and processes in the wake of the RBS-NatWest-Ulster banking system meltdown last month, word spread that Nationwide, the UK’s largest building society, had accidentally double-charged debit card transactions its customers made on Tuesday.
Ironically, Nationwide has been trying to use the problems at RBS Group to get those banks' customers to switch accounts. Nationwide admitted that 704 426 accounts (and some 2 million transactions) were affected by the glitch (with another 50 000 debit transactions declined on Wednesday because of it). The double-charging was blamed on human error. Apparently, someone sent the batch file for the debit card payments made on Tuesday through the building society’s processing system twice.
However, whatever schadenfreude the folks at RBS Group may have been feeling at Nationwide’s embarrassment was short-lived, as yesterday afternoon NatWest had to admit to yet another banking glitch where its own customers were now having problems with their debit cards and online banking accounts. NatWest says that the glitch has been fixed and has attributed its problems to a hardware error. Even so, the bank’s systems are being closely monitored today for signs of other problems that might be cropping up. Given that there are already reports today that some customers are still having difficulties, this is probably a very good idea.
Elsewhere in the Commonwealth, there is word via ZDNet Australia today that at the Commonwealth Bank of Australia, an “internal software upgrade fault has caused some the bank's branch offices to operate at limited capacity, and has crashed computers at its national head office.” About 95 of the 1000 bank branches have been affected. CBA has had a recent rash of IT glitches that I have blogged about previously.
Glitches this week weren't limited to the real world. Twitter also suffered a world-wide outage for about two-hours yesterday, which was blamed on a double failure in its data center and backup systems, reports the Wall Street Journal. The Journal quotes Mazen Rawashdeh, Twitter vice president of engineering, from his blog post as saying that, “What was noteworthy about today's outage was the coincidental failure of two parallel systems at nearly the same time. We are investing aggressively in our systems to avoid this situation in the future.”
Last month a “cascading bug” affected Twitter for several days.
There were two other high profile outages yesterday as well. Google Talk went down for about 5 hours, and Microsoft Azure’s cloud in Western Europe for about 2.
Last, but not least, Information Week reported that Cerner Corporation’s remote hosting service went down for some six hours on Monday, which meant that users of Cerner’s electronic health record system across the country and possibly around the world had to revert to paper and pencil. Cerner, which refused to say how many users were affected or the scope of the outage, blamed it on human error. It wouldn’t disclose exactly what the error involved, either.
Per standard operating procedure, the companies involved expressed their profound apologies for the inconvenience of it all and promised to try to keep it from happening again.
Robert N. Charette is a Contributing Editor to IEEE Spectrum and an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.