IEEE SpectrumFOR THE TECHNOLOGY INSIDER
TopicsAerospaceArtificial IntelligenceBiomedicalClimate TechComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsTelecommunicationsTransportation
SectionsFeaturesNewsOpinionCareersDIYEngineering Resources
MoreNewslettersPodcastsSpecial ReportsCollectionsExplainersTop Programming LanguagesRobots Guide ↗IEEE Job Site ↗
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
IEEE SpectrumAbout UsContact UsReprints & Permissions ↗Advertising ↗
Follow IEEE Spectrum
Support IEEE SpectrumIEEE Spectrum is the flagship publication of the IEEE — the world’s largest professional organization devoted to engineering and applied sciences. Our articles, podcasts, and infographics inform our readers about developments in technology, engineering, and science.
Join IEEE
Subscribe
About IEEEContact & SupportAccessibilityNondiscrimination PolicyTermsIEEE Privacy PolicyCookie PreferencesAd Privacy Options
© Copyright 2024 IEEE — All rights reserved. A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

Fat Finger Flub Takes Down Cloud Computing Datacenter

Several other cloud computer providers experience technical tempests, too

4 min read

Fat Finger Flub Takes Down Cloud Computing Datacenter

Illustration: Getty Images

IT Hiccups of the Week

A wide variety of IT-related blips, failures, and mistakes occurred last week. However, the most interesting IT Hiccups related story involved what was described as a “fat finger” error by an operator at the cloud computing service provider Joyent’s US-East-1 datacenter in Ashburn, Virginia. It disrupted operations for all of Joyent’s datacenter customers for at least twenty minutes last Tuesday. For a small number of unlucky Joyent customers, the outage lasted 2.5 hours.

According to a post-mortem note by a clearly embarrassed Joyent, “Due to an operator error, all us-east-1 API [application programming interface] systems and customer instances were simultaneously rebooted at 2014-05-27T20:13Z (13:13PDT).” The reason for the reboot, Joyent explained, was that a system operator along with other Joyent team members were “performing upgrades of some new capacity in our fleet, and they were using the tooling that allows for remote updates of software. The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter. Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay.”

The operator quickly recognized that something was amiss, but by then, the reboot sequence basically had to run its course. The Joyent team took steps to try to help speed the reboot process, which did keep the downtime for 80 percent of its customers to about 32 minutes, the company said. For some customers, however, the downtime was greater because of a “known, transient bug in a network card driver on [Joyent’s] legacy hardware platforms.” If you are interested, you can read the details of what happened and why at Joyent’s post-mortem, which unlike many other companies that suffer an embarrassing outage, was surprisingly and laudably forthright in its explanation.

In addition to apologizing to customers for the outage and the severe inconvenience caused, Joyent said that it was taking several steps to prevent this type of failure from occurring again, including “dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers and control plane servers to be rebooted simultaneously.”

Joyent indicated that it wasn’t going to punish the operator who mis-typed the reboot command. Joyent's chief technology officer Bryan Cantrill is quoted in The Register as saying: “The operator that made the error is mortified, there is nothing we could do or say for that operator that is going to make it any worse, frankly.” Cantrill promised that the operator, as well as the whole Joyent technical and management team, was going to fully absorb the lessons created from the extremely unpleasant experience.

Joyent wasn’t the only cloud provider that has been having technical issues recently. According to a story at InformationWeek two weeks ago, Rackspace admitted that in early May, “a problem occurred with customers seeking to spin up a large volume of solid state disks for Rackspace Cloud Block storage” at its Chicago and Dallas datacenters. Serial ATA disks were used as a work‐around. Rackspace indicated to InformationWeek that additional SSD capacity was made available at its Dallas location on 23 May and that the Chicago datacenter should have additional SSD capacity by 6 June.

Also about two weeks ago, beginning at 14:22 Pacific Time on Wednesday, 14 May and lasting until 18:06 Pacific Time the next day, Adobe users were unable to log in to Adobe’s Creative Cloud services. Adobe, in a note extremely thin on details in comparison to Joyent’s, blamed the problems on a “failure [that] happened during database maintenance activity.”

According to a detailed story at TechRepublic, the outage consequences were mixed. Those Adobe customers needing to use Adobe Digital Publishing Suite were dead in the water, as were customers needing Adobe software product updates or authentication as authorized Adobe product users/owners. Most other Adobe customers probably didn’t notice the outage at all—I know I didn’t. However, TechRepublic did say that the digital version of the UK's Daily Mail newspaper, ‘Mail Plus,’ was unavailable on 15 May, as a result of the log-in problem.

Adobe sent its usual regrets and promised that it wouldn’t happen again.

Finally, in what may be the most serious of the recent cloud computing problems, the LA Times reported recently that Dedoose, “one of the most popular cloud collaboration services not only crashed, but also may not be able to restore data added” from mid-April until early May. Later posts about the crash at the Dedoose blog, however, now seem to indicate that all the data entered between 1 April and 5 May are unlikely to be recovered. Dedoose is used by many academics—especially social scientists—around the world, the Times story states. Graduate students and other scientists indicated to the Times they feared that they had lost significant amounts of their research work.

The crash, Dedoose said, occurred on 6 May, about 1630 Pacific Time. According to an initial Dedoose blog post, the crash

resulted from a series of events leading to the failure of Dedoose services running on the Microsoft Azure platform. To be clear, Dedoose services failed, not Azure. In short, work done on one aspect of Dedoose led to the failure of another, cascading to pull down all of Dedoose. The timing was particularly bad because it occurred in the midst of a full database encryption and backup. This backup process, in turn, corrupted our entire storage system. Our immediate work with Microsoft support did not result in any substantial recovery.

Dedoose apologized with deep regret for what it said was “an unprecedented and completely unanticipated event” caused by an inherent technical danger it failed to recognize. The company promised to fix the error, and stated it was “going tour de force” on the future protection of customer data. That is good advice for anyone who stores information on a cloud and has absolutely critical data that must be protected, as well.

In Other News…

Computer Glitch Fakes Out Astronomers

Lenovo Offers $100 Discount to Make Up for Laptop Pricing Error

Computer Problem Hits Scottish Charity Workers’ Pay

UK’s Sainsbury’s Home Grocery Deliveries Delayed by Computer Fault

UK’s Barclays and RBS/NatWest Suffer Mobile Banking Failure

Severe Weather Alerts across U.S. Delayed by Computer Problems

Ford Recalling 900 000 SUVs for Software Steering Fix

adobe software rackspace datacenter human-machine interface it operator error computer crash dedoose glitches cloud computing outage it failures

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Augmented Reality Slims Down With AI and Holograms

Brain-Inspired Computer Approaches Brain-Like Size

Engineering Needs More Futurists

Related Stories

Why Electronic Health Records Haven't Helped U.S. With Vaccinations

Minsk’s Teetering Tech Scene

How Estonia's Management of Legacy IT Has Helped It Weather the Pandemic

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

Fat Finger Flub Takes Down Cloud Computing Datacenter

Several other cloud computer providers experience technical tempests, too

Augmented Reality Slims Down With AI and Holograms

Brain-Inspired Computer Approaches Brain-Like Size

Engineering Needs More Futurists

Related Stories

Why Electronic Health Records Haven't Helped U.S. With Vaccinations

Minsk’s Teetering Tech Scene

How Estonia's Management of Legacy IT Has Helped It Weather the Pandemic