| Failure is not an option: it's a style |
|
"Failure is not an option" is one of the most perplexing business mantras. Failure—in projects, business performance, and professional obligations—might not be a choice we would willingly make, but it is always a possibility. In fact, our unwillingness to contemplate and plan for the contingency of failure can ultimately work against us by making a bad situation worse. Consider the cascading failures caused by the recent fire in at Seattle's Fisher Plaza (photos). An electrical short in the building's switch room took out both the main and backup power feeds. It also triggered the building's sprinkler system, which promptly doused the backup generators with water. Thus, just as many US employees were (at least mentally) checking out for the 4th of July holiday, Fisher Plaza went dark and took several large data centers with it. Many companies that relied on these data centers simply went offline. Microsoft's Bing Travel was DOA for 36 hours; Authorize.Net for more than 10 hours; AllRecipes for more than a day. The list goes on: Redfin AdHost, Geocaching.com, Popcap Games.... There's been much postmortem tsking about full-redundancy failovers and functional continuity plans. The managerial decisions that allowed Bing Travel and Authorize.net to abandon customers for so long seem especially curious. Still, while the substance of those operational failures should certainly be parsed and analyzed, the style of those failures also bears attention. The Fisher Plaza fire took out email, Web, and/or phone communications for many organizations. Not all of them handled it gracefully. Since T2P is based in Seattle, I track a few local technology lists. Over the course of July 3 and 4, it was fascinating to watch how companies communicated (and did not communicate) their response to the crisis. Fisher Plaza's (non)communcations, for example, were a source of much frustration. According to Jeremy Irish, the founder of Geocaching.com:
(Irish further noted, in a post to the Seattle Tech Startups list, that the building's physical security was largely abandoned during the crisis.) One of the staff at Fisher Plaza responded to Irish's allegations, also via the Seattle Tech Startup list, that the Fisher Plaza Network Operations Center was mailing hourly updates to its customers. However, it was up to Internap, on which Geocaching.com and other sites were hosted, and other data centers to pass those updates on to their own customers—a challenge to which some did not fully rise. Irish and others actually on the scene at Fisher Plaza were just as ill-informed as the rest of the global community. Note that this was the third major outage and second fire Fisher Plaza has experienced in the past three years. Apparently, practice doesn't make perfect. As for the customers of Fisher Plaza's customers, the responses were even more various. Some simply hunkered down and mutely waited it out. However, a few of the fire's victims were relatively notable for their communications—good, bad, and ugly. Both Authorize.net and Geocaching.com lost Web and email, and both turned to Twitter as an ad hoc emergency broadcasting service. About eight hours after Authorize.Net went offline, the company set up a Twitter account to post updates about the outage. Geocaching.com conscripted the existing Twitter account of its co-founder, Jeremy Irish. According to Irish in a post to the Seattle Tech Startups list, he started July 3 with a handful of friends-and-family followers and ended it with more than 800. He now has over 1,900. Irish also posted running on-the-scene updates from his iPhone to his WordPress blog (July 3, 4, 7) and fed his Tweets on his Facebook page. Allrecipes.com, also used Facebook to spread the word. One of the advantages of the Facebook and WordPress announcements were that reader responses were readily visible on both and were generally sympathetic. Responses on Twitter are not readily visible, requiring some tedious clickthroughs to track down. Since all of the Twitter communciations on Authorize.Net and Irish's pages were outbound, they read much more negatively than pages that included user comments. Of course, Twitter and Facebook reached only some of both companies' constituents. The majority were simply set adrift. Most of the affected companies could have redirect their main URLs to a simple updates page or announcement page, perhaps with link their Twitter accounts or whatever outside channels they were using for updates. Instead, companies like Authorize.Net displayed an error for most of the outage. And then there were the phone problems. The fire took out AdHost's phone system entirely, although the company did post an annoucement to its Web site. Authorize.Net, by contrast, simply did not answer the phone. According to one Tech Crunch blog from July 3: "Nobody is picking up the phone at the U.S. offices of CyberSource, the holding company of Authorize.net. Someone I talked to at their UK offices couldn’t help me and told me I should keep trying the U.S. office." A commenter to this post further noted, "I called and got a message that the office was closed." Authorize.Net eventually, if belately, attempted a telephony make-good by opening its Customer Support lines for unusual Sunday shifts. You can only hope its health plan covers PTSD therapy for the support staff. The bottom line is this: Failure happens. Servers shut down, applications break, data centers go dark. Do what you will to prevent it, failure—often as a result of improbable, cascading breakdowns that are impossible to predict—is always an option. Prepare for it. Disaster recovery and business continuity plans are only part of the damage control you should have in place. Communications are the trickier part of the battle: If your data center, Web site, phone, and/or email go down, you should still have a plan in place for keeping your customers informed, calm, and confident in your business. So, ask yourself: What would your company do if its data center went offline? Your Web site would be down; operations would be crippled. Would you have email? If you did have email, would you have access to your customer list, so that you could send out an initial notice and status reports? Would you have phone service? Could your call center handle the increased (and irate) caller traffic? Whatever channels were available to you, what would you say? Whatever that is, write it down. Distribute it and, of course, keep hard copies in accessible locations. Make sure you call center has it. Enact whatever technology processes are necessary to make it possible. Whatever these processes are, remember that they must not rely on the technologies whose failure they are designed to respond to. Finally, consider what you'll do with those emergency communications channels. Authorize.Net's Twitter page has, since the crisis, become an alternate customer service portal. Suddenly, ongoing service failures and support issues that would in the past have been hidden in the depths of the company's support pages or taken offline, have a much higher public profile. Although it would be awkward and unwise for Authorize.Net to suddenly abandon its Twitter following, the better course of action would be to repurpose the page for marketing communications and stiff-arm all new support comments and queries with a referral to its standard support site. Related resources:
|





