On Monday, 4th of October, we experienced a global Facebook outage which also caused disruptions in Instagram and WhatsApp functioning. According to CNBC, this unprecedented incident finished the longest stretch of downtime for Facebook since 2008. The services were down for six hours, leaving the users scared and confused.
On Tuesday, Facebook posted an official statement explaining that the outage was caused by a configuration change to the backbone routers coordinating network traffic between the company’s data centers. This change had a cascading effect and brought all Facebook services to a halt.
In this article, the Rocketech experts are trying to find possible explanations for what happened to a social media giant and if such incidents are possible to prevent.
How Do Internet Providers Exchange Data?
If we want to understand why a service like Facebook doesn’t work, it’s better to start with the basic concepts. How does the Internet work? In simple words, the Internet is a network of networks. It’s divided into hundreds of thousands of subnetworks. Large companies, like Facebook, have larger networks, called autonomous systems.
An autonomous system (AS) is a stand-alone system with a set of native Internet Protocol (IP) addresses run by organizations with large IT infrastructures. By creating such systems, organizations can provide fault-tolerant access to their resources and define their routing policies.
When users visit Facebook (or Instagram or WhatsApp), their computers connect to its network using the so-called Border Gateway Protocol — a kind of Internet “post” service. This protocol identifies all possible information routes and selects the best one to take users to the page they want to visit.
Border Gateway Protocol (BGP) is a Standardized dynamic protocol intended to exchange routing information and host availability information between autonomous systems. Internet service providers (ISPs) use the protocol between border data centers.
Continuing the previous example, when users log in to Facebook or open the app to load their newsfeed, BGP sends a data “package” to the fastest route to get the required information from the Facebook servers. BGP is a protocol regulating the way each ISP tells another provider which destinations they can reach if we compare each AS to a post office processing those data “packages.”
For successful “package” deliveries, ISPs need the BGP mail service and destination IP addresses. Here’s when the Internet needs the Domain Name System.
Domain Name System (DNS) translates human-readable domain names (like www.rocketech.it) into IP addresses understood by the machine (like 192.158. 1.38).
In other words, DNS works as an address book, and BGP is a mail service that delivers mail to these addresses. Naturally, the system cannot deliver a package if there’s no route to the address.
What Went Wrong?
According to Cloudflare, a few minutes before the outage, Facebook stopped announcing the routes to their DNS prefixes. It means Facebook’s DNS servers were unavailable — DNS resolvers looking for the routes to the social network’s data couldn’t respond to queries asking for the IP address of facebook.com.
Here, we need to remember that Facebook is a content provider itself. It means the IP addresses owned by its AS correspond to its server addresses and not its users. Officially, the engineering team was accidentally disconnected from the company’s data centers during maintenance. As FB’s DNS servers create a separate network, they are designed to withdraw the BGP routes in case of losing connection with the data centers. It made the social network offline for the rest of the Internet.
Within two hours since Facebook and the associated services went offline, Twitter’s post welcoming “literally everyone” got more than three million likes. Having no access to WhatsApp, people switched to alternative messengers like Telegram and Signal.
Naturally, such a traffic imbalance had a great impact on other systems and services globally. The functionality of some apps, pages, or widgets directly depends on third-party services. In this case, systems requiring a Facebook login to function may have had troubles for the same reason — they could not reach the servers with the necessary data.
Some large Internet providers like Vodafone in Ireland also may have gone through hard times due to excessive traffic. Imagine millions of devices trying to get a response directly from facebook.com. When they got no response, they tried to access the service through the DNS servers of the local providers.
Moreover, Facebook employees were locked out of data centers. Usually, large companies have systems of emergency access that include access to a console port. The latter requires engineers’ physical presence near the routers. And it was impossible as the internal ID badges could not be verified while the FB servers were ironically offline.
Is There a Way to Prevent Network Outage?
There’s no definite answer. In many cases (and possibly in this case, too), the reason for faulty configuration and update problems is a simple human error. And the best way to deal with human error is to learn from it. It includes a thorough investigation of each incident, detailed documentation, and sharing of the experience.
In fact, all regulations are based on incidents. Analyzing each error and malfunction helps network engineers understand the roots and causes of such episodes. This understanding, in its turn, enables specialists to predict similar problems and think through possible ways to either prevent them or solve them faster and more efficiently.
Large companies with extensive IT architectures apply testing practices to their AS. It’s comparable to software testing for bugs but on a larger scale. Quite commonly, network engineers use testing labs equipped with soft- and hardware they use to maintain the actual network. Whether it’s testing existing software updates, implementing new programs, or installing new hardware — engineers can test it before launch.
Sometimes, however, it’s impossible to conduct full-scale testing. These scenarios require a detailed, strictly regulated action plan. Such plans are based on various operating procedures, instructions, and leading specialists’ expertise and experience. They also imply multiple stages of reviewing by different professionals before they get approval.
Although many possible explanations remain speculative as everything about Facebook’s IT infrastructure is proprietary information, here’s what we can say for certain. There are two reasons for major incidents of such a kind — strictly technical and human error-related. And even if large systems require automation aimed to minimize the human error factor, an unfortunate combination of circumstances may yet cause great disruptions of global impact.