BGP and The Role It Played in Facebook’s October 2021 Outage
On 4 October 2021, Facebook went completely offline for six hours, along with Instagram and WhatsApp. The world was impacted, and it has become one of the biggest outages in 2021. Before an official statement was released, many experts had speculated that the outage had something to do with Border Gateway Protocol (BGP). But what is BGP? What does it have to do with the outage?
What is BGP?
BGP is a data routing protocol which ensures any data you send over the internet gets to its intended destination as quickly as possible. When data is being sent via the internet, BGP’s job is to look at all of the available paths the data could travel and pick the best route, which frequently involves hopping between autonomous systems.
When talking about BGP, the internet is divided into networks known as autonomous systems (AS). An AS refers to a large network or group of networks that has a joined routing policy. Each AS is owned and managed by an entity — an ISP, a company, or other established organizations like a government or major university. BGP is the protocol used between ASs which shows your data which route it has to go through to get to its destination.
BGP makes routing decisions based on paths, rules, or network policies configured by a network administrator. However, when it comes to data routing, the shortest way doesn’t always mean the best. There are many reasons why a routing algorithm would choose one path over another — cost can be a factor as well, as some networks charge others if they want to include them in their routes.
How does BGP work?
BGP has been compared to various things. Cloudflare described BGP as the postal service of the Internet. In mail delivery, a postal service system involves post office branches and mailboxes — public boxes where mail is placed to be collected by the post office. The mail put into each mailbox must go through the local postal branch before being routed to another destination.
ASs are like individual post office branches, and the internal routers within an AS are like mailboxes. The routers forward their outgoing transmissions to the AS, which then uses BGP to send these transmissions to their destinations.
Meanwhile, The Verge described BGP as a map provider. We can imagine BGP as a bunch of people making and updating maps that show your data how to get to its destination site. With new routes popping up and existing routes becoming unavailable, the maps have to be constantly updated.
The structure of the internet is also constantly changing. Therefore, every AS must be kept up-to-date with information regarding new systems as well as obsolete systems it can use for data routing. As it would be a hassle to map the entire internet all the time, ASs share their maps. They’ll occasionally communicate with other ASs to see and copy any updates they’ve made to their maps.
What really happened to Facebook?
Facebook representatives have shared their version of why the outage occurred, but The Verge has simplified it for us.
According to The Verge, the outage occurred during Facebook’s routine maintenance. A command issued as part of the maintenance accidentally disconnected all of Facebook’s data centers. When the company’s DNS servers saw that the network backbone was no longer talking to the internet, they stopped sending out BGP advertisements because it was clear that something had gone wrong. This looked like Facebook telling the rest of the internet to take its servers off their maps. In fact, Cloudflare reported that it had seen BGP updates from Facebook in the form of route withdrawals before it went offline.
Therefore, in summary, BGP did play a part in the Facebook outage but wasn’t the root cause. Facebook’s BGP took its service off the map. However, it only did so because the company’s infrastructure was down due to other reasons and the Facebook servers on the maps no longer existed.