The Most Shocking Thing About Facebook Group's Outage

by Iwan Price-Evans on High Availability • October 5, 2021

What's the most shocking thing about Facebook Group's outage, which took Facebook, Instagram, and WhatsApp offline on October 4, 2021? The millions of missed messages? That it took around seven hours to identify and resolve the problem? The estimated $50 billion wiped off Facebook's value?

No, the most shocking thing is that it took until now for this to happen.

Am I a Facebook doom-monger? Have I been predicting their downfall for years? Am I here to tell you this wouldn't have happened on my watch? Far from it. I've spent the last fifteen years living in the same world as everyone else – a world where Facebook is ubiquitous and never not there. We have come to take Facebook's presence and its resilience for granted. Facebook going dark for hours on end was unimaginable. 

So I'm here bringing the benefit of hindsight. Big deal, right? But following an unprecedented disaster, analysis in hindsight rather than in the heat of the moment is our best tool for understanding what happened and planning to stop it from happening again.

Why Did Facebook Go Down?

Wired has a great summary of the Facebook outage, with expert opinion, and I recommend you read it for yourself. I'll share the key parts here, with all credit to Wired.

DNS and Border Gateway Protocol

“Facebook’s outage appears to be caused by DNS; however that’s a just symptom of the problem,” says Troy Mursch, chief research officer of cyberthreat intelligence company Bad Packets. The fundamental issue, Mursch says—and other experts agree—is that Facebook has withdrawn the so-called Border Gateway Protocol route that contains the IP addresses of its DNS nameservers. If DNS is the internet’s phone book, BGP is its navigation system; it decides what route data takes as it travels the information superhighway.

You might be wondering why Facebook is doing anything with the Border Gateway Protocol (BGP). You can see a good summary of BGP over on Cloudflare's website, but the main thing to understand is that BGP is a protocol that enables navigation and communication between autonomous systems, including DNS nameservers - so it's something that's normally used to power the Internet's backbone, and not something used by a single enterprise like Facebook.

What went wrong with BGP?

What makes Facebook so different? Why is it using BGP at all?

Answer: size. Facebook is one of only a few enterprises in the world big enough that it benefits from replicating some of the attributes of the external Internet within its own infrastructure, whether for performance, customizability or (ironically) resilience. For example, by having its data centers function as autonomous systems with external routing via BGP, each data center can theoretically remain accessible to the Internet even if the rest of Facebook goes down.

Okay, so what happened to interfere with their use of BGP?

The internet infrastructure experts who spoke to WIRED all suggested the likeliest answer was a misconfiguration on Facebook’s part. “It appears that Facebook has done something to their routers, the ones that connect the Facebook network to the rest of the internet,” says John Graham-Cumming, CTO of internet infrastructure company Cloudflare, who stressed that he doesn’t know the details of what happened.

How did it spread?

Could this have all started with misconfiguration? Possibly one person making a mistake? But Facebook is enormous, with multiple layers of resilience: multiple data centers, multiple locations, and a plethora of application delivery controllers (ADCs) and content delivery networks (CDNs), all with redundant backups. How can one error bring it all crashing down?

Answer: size (again). Even mid-sized enterprises need to use automation to make IT Operations scalable. Automation can massively increase the efficiency of repetitive tasks, and reduce costs for organizations that would otherwise be massively overstretched. And since we can't say it often enough, Facebook is huge. With data centers everywhere, managing configuration changes manually would require thousands of staff on standby all around the world. Automation can take a configuration change and propagate it to all network nodes with near-instant results. 

The automation that enables Facebook's incredible efficiency at scale also makes it vulnerable to a configuration error. Without automation, a configuration error could cause a localized outage. With automation propagating every configuration change globally, an error can bring the house down.

Has this happened before?

Most people have never heard of BGP. How significant is it, really? Has bad BGP configuration caused outages before? Actually, yes. That same Cloudflare article gives a few examples.

In 2004 a Turkish Internet service provider (ISP) called TTNet accidentally advertised bad BGP routes to its neighbors. These routes claimed that TTNet itself was the best destination for all traffic on the Internet. As these routes spread further and further to more autonomous systems, a massive disruption occurred, creating a 1-day crisis where many people across the world were not able to access some or all of the Internet.

Similarly, in 2008 a Pakistani ISP attempted to use a BGP route to block Pakistani users from visiting YouTube. The ISP then accidentally advertised these routes with its neighboring AS’s and the route quickly spread across the Internet’s BGP network. This route sent users trying to access YouTube to a dead end, which resulted in YouTube being inaccessible for several hours.

Incidents like these can happen because the route-sharing function of BGP relies on trust, and autonomous systems implicitly trust the routes that are shared with them.

The key phrase here bears repeating: "incidents like these can happen because the route-sharing function of BGP relies on trust". Under the "rules" of BGP, a system receiving a message trusts the sender by default, with no further authentication. When Facebook's configuration error started propagating to all their data centers, the protocol at the heart of it was fundamentally unsuited to stopping it.

Why did a configuration error happen?

What was the IT engineer responsible (and hey, you've got to feel for this person today) trying to achieve? Facebook's IT engineers aren't making global config changes to essential infrastructure for fun. Perhaps they were under pressure to cut costs, or to increase performance, or increase security. Understanding Facebook's outage begins with understanding that their IT team is under perhaps more commercial pressure than any other IT team around.

How was the configuration change managed? Typically, IT and DevOps teams will ensure configuration changes are tested in a secure staging environment before deploying them in a live environment. If the change will have a significant impact, it will normally require authorization from a senior manager, with verification that it passed the testing phase. Why didn't that help Facebook yesterday? Changes affecting BGP can't be tested easily in a staging environment. This makes success dependent on careful planning and forward-thinking, which itself has limits as you cannot fix problems that you cannot predict.

Why did it take so long to fix it?

Ordinarily, when a data center has a problem exposing itself to the external Internet, an organization's IT engineers will connect remotely to the data center using their internal network to diagnose and fix the problem. In this case, with Facebook's data centers managed centrally using connectivity routed by BGP, those data centers were unable to expose themselves even to the internal IT team. They had gone completely dark, as though someone had severed the cables connecting them to the outside world.

Facebook needed on-site engineers to correct the problems in each data center. This took time. As we've said already, Facebook has a lot of data centers in a lot of locations. Furthermore, Facebook needed old-fashioned network engineers – people who understand the underlying protocols, not software developers or AI visionaries. One has to wonder how many of these engineers Facebook has available at a moment's notice.

OK, So Why Did This Really Happen?

These technical causes might be confusing to some, and they only go so far in answering the question of why Facebook went down. The implied question is not so much "how did this happen?" but "how was this allowed to happen?".

Doesn't Facebook have systems and processes in place to stop failures like this from ever happening? After all, Facebook is too big and too important to too many people for failures to be a tolerable risk. So, again, why was this allowed to happen?

Well, we've said it twice already, so let's make it three-for-three. Answer: size.

Facebook is operating network infrastructure, applications, and a commercial enterprise at a scale few have ever attempted until now. Only a handful of other companies (like Google) are in the same sentence as Facebook when it comes to scale. It's for this reason that anyone responding to this outage with "well, you could have avoided this if you'd only done X, Y, Z" is missing the point.

There is no set of instructions for what Facebook is doing. You can't just take the enterprise IT playbook for the average business and apply it to Facebook, and say "why didn't you have a redundant active backup?".

Facebook is straining at the limits of what's possible and what's even imaginable in systems architecture, power and traffic management, data storage, security and compliance, machine learning, performance and latency, team structure and communication, and more. This has never been done before.

The imperative now is not to look backward and point fingers but to consider: how can Facebook adapt and become even more resilient to failure, and how can any growing organization plan for resilience at scale if even Facebook is vulnerable?

Facebook Outage Was Inevitable

The combination of scale, lack of precedent, and insanely high expectations (from users, advertisers, shareholders, and political groups) puts Facebook under pressure to continually push at the boundary, which introduces risk. 

Facebook's scale means it also attracts the attention of people who might want to do it harm. On the same day as the outage, Facebook whistleblower Frances Haugen laid bare some of the issues. Commenting on the company's approach to free speech and hosting hate speech, she alleged "conflicts of interest between what was good for the public and what was good for Facebook." Facebook's position as the default social media platform makes it a target for those complaining their user accounts were banned and for those complaining that Facebook permitted those accounts in the first place. Facebook is caught in the middle of a culture war and is undoubtedly under continuous attack from people trying to disrupt its operations to make a political point. 

Under such immense pressure, it's amazing that Facebook has proven to be a resilient platform for so long. It has become as essential to many people's lives as utilities like telephony and energy. Even those of us who don't use Facebook often or at all take its persistent presence for granted. But yesterday's outage shows the folly of believing in the permanence of anything – even giants of the modern age like Facebook.

Facebook has been criticized for many things; reliability has never been one of them. Facebook's IT engineers and operations team have accomplished great things. We only stopped to think about them on the one day when they failed. Perhaps we should raise a toast to the other ~4,500 days when they didn't.