Take heart; even the Facebook behemoth is not foolproof

Mark Zuckerberg announces plans for a Facebook dating app at F8 Developers Conference in San Jose, California on Tuesday. PHOTO | AFP
A little before 7pm on Monday evening, I was sending some messages on WhatsApp. A moment earlier the app showed that the messages were delivered, but moments later it showed that they couldn’t be forwarded to the intended recipient.
I did what many people do to in that situation: check internet connectivity, check Yahoo Mail, check other apps – all were working fine. But the messages in WhatsApp kept staring at me.
So, I closed and restarted WhatsApp several times – same story. I restarted the phone, twice – same story. What could be the problem? Surely the problem couldn’t be with WhatsApp itself, so there is no need trying to check that, was there?
The assumption that the problem was local – within my mobile – made me continue to pointlessly try to find a problem where there was none. Only when I decided to find out what was happening to others did I realise that Facebook, and its associated apps, including WhatsApp and Instagram, were facing a global outage.
Despite my background in internet systems, had you given me money to bet on Facebook going down, I would have bet against that. You hear of those technology companies that hire top graduates from top schools in the world to work in futuristic offices that look like the inside of Las Vegas casinos? That is Facebook. You have to be unhinged to bet against them.
But on that evening, this organisation that serves 3.5 billion individual subscribers in the world suffered a worldwide outage for six hours. Six solid hours!
It’s difficult for many unfamiliar with standards expected at this level of service provision to appreciate what that means. These are companies which have layers and layers of protection and redundancies to ensure practically 100 percent uptime. Six to seven hours is an apocalyptic scenario, the end of the world as we know it.
Think of what this outage did to the world. Many people’s conceptualisation of the outage is limited to that naïve teen who failed to upload her selfies on Instagram. They unwittingly forget thousands and thousands of businesses in India, Brazil, and Africa, which rely on WhatsApp for orders, deliveries, and payments. They were all brought to a complete halt. The idea of “Log in with Facebook” must have looked to be truly inspired to many, but it became a nightmare to many businesses that depended on clients being authenticated that way.
So, what happened?
According to Facebook’s VP of Infrastructure, Santosh Janardhan, the outage was the result of a “faulty configuration change”.
“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted this communication,” he said. “This disruption to network traffic had a cascading effect…bringing our services to a halt.”
In technical terms, the experts deduce that changes in BGP (border gateway protocol), an internet protocol that directs traffic from one (autonomous) system to another, made the DNS (domain name system), an internet system that helps the internet to know where the intended address is, unavailable. This made my attempt to send a message through WhatsApp akin to posting a letter addressed to “uncle”. The internet had no idea who this uncle was and how to find him. The letter would have been discarded or returned to the sender.
One should really give a passing thought to that poor engineer who made a booby that disconnected a multi-billion-dollar network infrastructure from the internet. That has got to hurt. However, while that explanation may eliminate the possibility of some other causes, say DDoS attack, it doesn’t tell us how anyone entrusted with managing such a system can make such a mistake. Therefore, the possibility exists that this was an act of sabotage or an outside attack.
Moreover, assuming that this was a normal human error, it looks strange that a company as sophisticated as Facebook (with its own proprietary network algorithms) would allow such errors to remain a possibility in its system. If one faulty configuration cost Mark Zuckerberg alone $6 billion, discounting what his clients lost, that’s a decision whose possibility needed to be eliminated through planning and design.
There is one thing though which appears to have escaped the attention of all analysts I have read on this subject. The New York Times reported that Facebook had to physically send engineers to its data centres to manually restore its servers. The fact that Facebook’s own internal system were totally cut off to the point that the only way the problem could be resolved was by sending someone to site physically is unfathomable. What this means is that Facebook, metaphorically speaking, cut the branch on which it was sitting.
This points to an extraordinarily serious infrastructure design flaw. It appears that somehow, despite Facebook’s sophistication, they failed to foresee an eventuality which probably cost them billions of dollars’ worth of valuable minutes in sending people to site. In network design and management, one thing that you learn from painful experiences is usually to maintain that one option, a back channel or a completely isolated link, which would ensure that you never cut yourself away from the trunk, because there is no coming back from that.
The fact that Facebook can have such issues in its infrastructure should make many a network engineer who beat themselves up when they commit serious blunders to breath a sigh of relief. Apparently, despite all the sophistication that Facebook has, they are flesh and blood too.