CenturyLink and Level 3’s North American network outage possibly affected up to 3.5% of global web traffic. Agora’s SD-RTN™ is designed to ensure low impact from such outages. This article outlines the challenges and how Agora solves them.
Level 3 Network Outage
On August 30th 2020, CenturyLink, one of the largest telecommunications operators in the United States, suffered an outage in its Level 3 network due to Border Gateway Protocol (BGP) routing errors. The problem caused a chain reaction that paralyzed many services, including Twitter, Xbox Live, Garmin, Steam, Discord and Blizzard. According to statistics from Cloudflare, a US CDN provider, the problem ultimately caused a 3.5% drop in global web traffic.
CenturyLink stated that the reason for the outage was a problematic Flowspec announcement issued by the Mississauga data center. This announcement prevented BGP from establishing connections across the entire CenturyLink/Level 3 network. Flowspec is an extension of the BGP protocol that allows operators to quickly distribute firewall rules throughout their networks. It is often used to respond to security incidents, such as BGP hijacking or DDoS attacks, as they can perform updates across entire networks and mitigate attacks within seconds. However, an abnormal Flowspec announcement can cause BGP routing to fail, as it did in this recent case.
CenturyLink manages the world’s largest and most complex autonomous system (or AS). An AS refers to the entirety of all IP networks and routers under the jurisdiction of one or more entities. Together the world’s autonomous systems coordinate to manage the global internet’s routing strategy, and CenturyLink manages a huge number of routing tables as part of that strategy. Consequently, the problematic Flowspec announcement within their network resulted in widespread global network outages.
Due to the nature of the BGP protocol, it often takes several hours or more to gradually recover after the failure has spread. In this case, it took about 7 hours for CenturyLink to resume service. Network failures of this type are difficult to eliminate completely, and they are unpredictable.
Agora immediately detected the network abnormality in North America on Aug 30, and adopted automatic systematic responses such as automatic line switching and automatic disabling of impacted data centers to minimize potential impact to users of our customers, except for those who are only accessing the Internet through Level 3.
How Did We Do It?
The Agora SD-RTN™ has been designed to shield our services from network failures of this type. As part of our network design, a real-time dynamic allocation strategy is built into the Agora SDK. Once a network failure is detected, the user’s connection is allocated to available nodes and the user’s normal network experience is restored within seconds.
In addition to the real-time dynamic allocation strategy, Agora has implemented other strategies to defend against network failures, including multiple transmission and dynamic transmission strategies. When users access our SD-RTN™ virtual network, we automatically generate a transmission strategy for the entire network. Using this Level 3 network failure as an example, if there is traffic in the Agora SD-RTN™ network that passes through Level 3 nodes, our service automatically senses the faulty line, dynamically updates the transmission strategy, and automatically selects the optimal transmission link. The faulty route is seamlessly avoided.
With these strategies Agora can guarantee that most of our customers’ services will be available, even when the Level 3 network is temporarily unavailable.
At Agora, we understand that reliability is important to our customers, and that no technology can be 100% reliable. We are committed to continuing our investment in technology, people, and processes to provide our customers with the best possible service.