Key lessons from Facebook’s worldwide system shutdown

Around the world, Facebook, WhatsApp, and its attendant applications are synonymous with the internet. Shutting down Facebook is taking away people’s treasured service for communicating with friends and family, debating politics, swapping therapeutic gossips, and a means of expanding their businesses.

For many, Facebook is not just a social media platform. It is an access key for websites and apps. Instead of keeping tens of passwords of websites and apps, individuals use Facebook to authenticate their identity.

WhatsApp, on the other hand, is an essential utility for commerce, especially for small-scale traders. This week’s six-hour outage caused inestimable social and economic loss to millions of people in Africa, Asia and Central and South America who rely on it daily for both businesses, as an advertising tool, and for communication.

Facebook claimed that the problem was caused by a software upgrade that went bad. The situation was so bad that not only did all customer-facing services go down, but Facebook’s email system and a coterie of other internal systems, including the entrance to the Facebook office building, stopped working.

Facebook staff had trouble making calls from their work cellphones and could not receive emails from people outside. Facebook’s internal communications platform broke down, leaving many helpless amidst a mounting crisis.

The half-day outage shouldn’t have been a surprise for anyone who works with technology infrastructure. Although no system is immune to bugs, the recent cascade of crisis exposed Facebook’s lack of functional continuity and resiliency plan, a must-have requirement for any enterprise.

A cardinal tenet of technology infrastructure management is to segment services such that when one section of the network goes down, the whole network isn’t crippled. Facebook failed this test. Its entire network went kaput.

It’s also troubling to learn that when users were shut out of their social media platforms, Facebook workers could not also access their offices because their smart keys were unresponsive. The smart keys rely on the same system that was now neutered. That’s a textbook definition of a disaster!

Splitting different systems is not just good practice, but it’s a safe one too. If a hacker cracked the unsegmented Facebook system, the whole network would collapse. But if different services, geographical regions, or functions were segmented, some sections of would continue running while engineers fix the outage.

Facebook says that the problem arose when their engineers upgraded their networking devices or routers. That the upgrade triggered a monumental glitch. That raises serious questions. Why did they upgrade the network without first testing it? The common practice is to simulate software upgrades in the lab and only when sure that they glitch-free, deploy them in the live networks.

This thumbs-down setback left Facebook’s leadership with an egg on their faces. It opened doors for its patrons to try other less popular social media. Will they go back?