Fat Finger Flub Takes Down Cloud Computing Datacenter

Illustration: Getty Images

IT Hiccups of the Week

A wide variety of IT-related blips, failures, and mistakes occurred last week. However, the most interesting IT Hiccups related story involved what was described as a “fat finger” error by an operator at the cloud computing service provider Joyent’s US-East-1 datacenter in Ashburn, Virginia. It disrupted operations for all of Joyent’s datacenter customers for at least twenty minutes last Tuesday. For a small number of unlucky Joyent customers, the outage lasted 2.5 hours. 

According to a post-mortem note by a clearly embarrassed Joyent, “Due to an operator error, all us-east-1 API [application programming interface] systems and customer instances were simultaneously rebooted at 2014-05-27T20:13Z (13:13PDT).” The reason for the reboot, Joyent explained, was that a system operator along with other Joyent team members were “performing upgrades of some new capacity in our fleet, and they were using the tooling that allows for remote updates of software. The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter. Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay.”

The operator quickly recognized that something was amiss, but by then, the reboot sequence basically had to run its course. The Joyent team took steps to try to help speed the reboot process, which did keep the downtime for 80 percent of its customers to about 32 minutes, the company said. For some customers, however, the downtime was greater because of a “known, transient bug in a network card driver on [Joyent’s] legacy hardware platforms.”  If you are interested, you can read the details of what happened and why at Joyent’s post-mortem, which unlike many other companies that suffer an embarrassing outage, was surprisingly and laudably forthright in its explanation.

In addition to apologizing to customers for the outage and the severe inconvenience caused, Joyent said that it was taking several steps to prevent this type of failure from occurring again, including “dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers and control plane servers to be rebooted simultaneously.” 

Joyent indicated that it wasn’t going to punish the operator who mis-typed the reboot command. Joyent's chief technology officer Bryan Cantrill is quoted in The Register as saying: “The operator that made the error is mortified, there is nothing we could do or say for that operator that is going to make it any worse, frankly.” Cantrill promised that the operator, as well as the whole Joyent technical and management team, was going to fully absorb the lessons created from the extremely unpleasant experience.

Joyent wasn’t the only cloud provider that has been having technical issues recently. According to a story at InformationWeek two weeks ago, Rackspace admitted that in early May, “a problem occurred with customers seeking to spin up a large volume of solid state disks for Rackspace Cloud Block storage” at its Chicago and Dallas datacenters. Serial ATA disks were used as a work‐around. Rackspace indicated to InformationWeek that additional SSD capacity was made available at its Dallas location on 23 May and that the Chicago datacenter should have additional SSD capacity by 6 June.

Also about two weeks ago, beginning at 14:22 Pacific Time on Wednesday, 14 May and lasting until 18:06 Pacific Time the next day, Adobe users were unable to log in to Adobe’s Creative Cloud services. Adobe, in a note extremely thin on details in comparison to Joyent’s, blamed the problems on a “failure [that] happened during database maintenance activity.”

According to a detailed story at TechRepublic, the outage consequences were mixed. Those Adobe customers needing to use Adobe Digital Publishing Suite were dead in the water, as were customers needing Adobe software product updates or authentication as authorized Adobe product users/owners. Most other Adobe customers probably didn’t notice the outage at all—I know I didn’t. However, TechRepublic did say that the digital version of the UK's Daily Mail newspaper, ‘Mail Plus,’ was unavailable on 15 May, as a result of the log-in problem.

 Adobe sent its usual regrets and promised that it wouldn’t happen again.

Finally, in what may be the most serious of the recent cloud computing problems, the LA Times reported recently that Dedoose, “one of the most popular cloud collaboration services not only crashed, but also may not be able to restore data added” from mid-April until early May. Later posts about the crash at the Dedoose blog, however, now seem to indicate that all the data entered between 1 April and 5 May are unlikely to be recovered. Dedoose is used by many academics—especially social scientists—around the world, the Times story states. Graduate students and other scientists indicated to the Times they feared that they had lost significant amounts of their research work.

The crash, Dedoose said, occurred on 6 May, about 1630 Pacific Time.  According to an initial Dedoose blog post, the crash 

resulted from a series of events leading to the failure of Dedoose services running on the Microsoft Azure platform.  To be clear, Dedoose services failed, not Azure. In short, work done on one aspect of Dedoose led to the failure of another, cascading to pull down all of Dedoose. The timing was particularly bad because it occurred in the midst of a full database encryption and backup. This backup process, in turn, corrupted our entire storage system. Our immediate work with Microsoft support did not result in any substantial recovery.

Dedoose apologized with deep regret for what it said was “an unprecedented and completely unanticipated event” caused by an inherent technical danger it failed to recognize. The company promised to fix the error, and stated it was “going tour de force” on the future protection of customer data. That is good advice for anyone who stores information on a cloud and has absolutely critical data that must be protected, as well.

In Other News…

Computer Glitch Fakes Out Astronomers

Lenovo Offers $100 Discount to Make Up for Laptop Pricing Error

Computer Problem Hits Scottish Charity Workers’ Pay

UK’s Sainsbury’s Home Grocery Deliveries Delayed by Computer Fault

UK’s Barclays and RBS/NatWest Suffer Mobile Banking Failure

Severe Weather Alerts across U.S. Delayed by Computer Problems

Ford Recalling 900 000 SUVs for Software Steering Fix

 

Advertisement

Risk Factor

IEEE Spectrum's risk analysis blog, featuring daily news, updates and analysis on computing and IT projects, software and systems failures, successes and innovations, security threats, and more.

Contributor
Willie D. Jones
 
Advertisement