At 6:27 AM Eastern Time this morning, 20% of the internet went dark.
A configuration file, designed to manage threat traffic, grew beyond its expected size on Cloudflare’s servers.
This one file had cascaded into what Cloudflare’s Chief Technology Officer Dane Knecht described bluntly:
“We failed our customers and the broader Internet.”
That’s it. No sophisticated attack, no hardware failure, no natural disaster.
For the next three hours, roughly a fifth of the world’s websites went dark. X couldn’t load. ChatGPT stopped responding. Spotify fell silent. Discord, League of Legends, Zerodha, and thousands of other services displayed nothing but cryptic 500 errors.

Even Downdetector — the site millions reflexively check when the internet breaks — couldn’t serve its own outage reports. The irony landed hard on social media once services returned.
Yet during those same three hours, real-time video calls, live streams, and interactive broadcasts powered by Agora continued without interruption. While one of the internet’s largest infrastructure providers experienced cascading failures affecting an estimated 20% of websites worldwide, applications built on Agora’s Software Defined Real-Time Network maintained carrier-grade quality.
When Latency Becomes Lost Time
There’s a category distinction that matters deeply here, one that often gets obscured in discussions of “uptime” and “service availability.”
When a webpage takes extra seconds to load or an email is delayed, the content eventually arrives intact. Real-time communication operates under a fundamentally different contract.
When a live video call stutters, those frames are gone. When an audio stream drops packets, those words can’t be reconstructed. The minutes spent waiting for a call to reconnect or straining to understand garbled audio represent an irreversible expenditure of every participant’s attention and time. That’s not a service degradation — that’s a service failure, regardless of whether the technical systems eventually recover.
This is why telecommunications operators developed the concept of “carrier-grade quality.” It’s not marketing language. It’s an acknowledgment that certain services — emergency communications, financial trading floors, medical consultations — require a qualitatively different approach to reliability than services where retransmission and buffering can paper over network imperfections.
At Agora, the conviction is that real-time engagement services must be designed with carrier-grade quality as a foundational requirement, not an aspirational target. Every architectural decision flows from this premise.
The Single Vendor Problem
The Cloudflare incident and its predecessors illuminate a structural vulnerability in how modern internet infrastructure has evolved.
Cloudflare provides services to an estimated 20% of all websites worldwide. Amazon Web Services hosts an enormous portion of the internet’s applications and data. Microsoft Azure underpins critical enterprise systems across industries. These are excellent services, engineered by talented teams, serving vital functions.
They’re also, as Alp Toker of NetBlocks told the BBC following today’s outage, “one of the internet’s largest single points of failure.”
This isn’t a criticism of these companies’ engineering. It’s a recognition of an architectural pattern that has emerged organically as the internet has consolidated around a small number of hyperscale providers. The economics make sense. The convenience is undeniable. The security capabilities these providers offer are often beyond what individual organizations could develop independently.
But the pattern creates a specific kind of risk: correlated failures that cascade across enormous portions of the internet simultaneously. When a configuration file grows beyond expected size at one of these providers, the consequences don’t affect a single customer or even a handful. They affect thousands of businesses and millions of users at once.
For services where latency tolerance is measured in hundreds of milliseconds and where every disruption directly wastes human attention and time, this concentration risk is unacceptable.
Designing for Resilience: The SD-RTN Architecture
The reason Agora services remained operational while Cloudflare-dependent services failed lies in a fundamental architectural philosophy: never rely on any single vendor’s service reliability. This isn’t distrust of partners. It’s recognition that carrier-grade quality for real-time services requires architectural decisions that no individual vendor can provide alone, regardless of their engineering excellence.
The Agora Software Defined Real-Time Network (SD-RTN) was built from first principles to deliver carrier-grade quality even when portions of the underlying internet infrastructure are degraded or failing. This approach proved its worth today — not in theory, but in practice.
The architecture accomplishes this through three core design principles:
1. Geographic Redundancy Without Single Points of Failure
The SD-RTN maintains globally distributed Points of Presence (POPs) that operate as an interconnected mesh. Every POP continuously measures performance to every other POP — not periodically, but in real-time. This creates intelligent routing that responds to actual network conditions as they happen, rather than relying on standard internet routing protocols that optimize for cost rather than latency or reliability.
When Cloudflare’s configuration error propagated through their global network this morning, the SD-RTN simply routed around degraded paths — because the architecture assumes that some portion of the internet will always be experiencing problems.
2. Redundant Transmission Across Multiple Paths
The SD-RTN sends packets through several separate optimized paths simultaneously. The packet that arrives first is used; late or failed packets are ignored — their role’s already been fulfilled through an alternate path.
This fundamentally changes reliability characteristics: rather than depending on any single path or vendor to perform correctly, the system assumes some paths will fail at any given moment and builds resilience into the transmission strategy itself.
3. End-to-End Quality Management
Beyond the network backbone, Agora’s SDK handles anti-packet-loss measures during the last-mile journey to end users. This layered approach — from backbone to edge — ensures consistent quality even when parts of the underlying internet infrastructure are degraded.
Performance Under Real Conditions
Architecture claims are easy. Measured performance under real-world conditions is what matters. The following data compares public internet routing versus SD-RTN routing across geographic scenarios, examining latency across percentiles — because the tail of the distribution is where reliability reveals itself.
Why This Matters for High-Value Use Cases
This morning’s Cloudflare incident lasted three hours. For social media platforms or streaming music services, that meant frustrated users and lost engagement — inconvenient, damaging to brand perception, but fundamentally recoverable.
For real-time services, the calculus is different. During those same three hours:
- Live medical consultations between specialists and on-site physicians continued uninterrupted on Agora
- Real-time financial communications maintained microsecond precision
- Emergency response systems coordinating across agencies remained operational
- Live events broadcasting to paying audiences never went dark
The difference wasn’t luck. It was architecture. When portions of the internet failed, systems built on vendor-dependent infrastructure went down. Systems built on resilient, multi-path architectures kept running.
For services where three hours of downtime means failed procedures, missed trades, compromised emergency response, or lost broadcast revenue — the time can’t be recovered. These aren’t degraded experiences; they’re failed experiences.
This is why Agora maintains that only carrier-grade service quality can meet the demands of high-value business and societal needs. The question isn’t whether a provider has “good uptime.” It’s whether the architecture is fundamentally designed around the premise that every second of disruption destroys value that can’t be recovered.
Beyond Uptime: A Philosophy of Resilience
The recent cascade of infrastructure incidents — Cloudflare this morning, AWS and Azure in October — aren’t evidence of poor engineering. They’re evidence that the internet has evolved architectural patterns creating correlated risk at massive scale. As EMARKETER analyst Jacob Bourne told Business Insider: “We’re seeing outages happen more frequently, and they’re taking longer to fix.”
For real-time communication services, the response can’t be hoping infrastructure providers improve reliability. The response must be designing architectures that never rely on any single vendor’s service reliability in the first place.
Today demonstrated what this means in practice. While services dependent on a single infrastructure provider went dark, Agora’s services remained operational — not because of operational heroics, but because the SD-RTN’s architecture fundamentally assumes that portions of infrastructure will always be degraded somewhere and routes around them automatically.
Globally distributed points of presence eliminating single-site dependency. Real-time routing intelligence responding to actual network conditions. Redundant transmission across multiple optimized paths. These aren’t features layered onto conventional architecture — they’re the architecture, built from first principles around the conviction that real-time engagement requires real-time resilience.
When the next infrastructure incident occurs — and it will — the question for any real-time service will be: Does your architecture assume every component will always work perfectly? Or does it assume something will always be degraded somewhere, and deliver consistent quality anyway?
Carrier-grade resilience isn’t about building systems that never fail. It’s about building systems that deliver carrier-grade quality even when parts of the underlying infrastructure are failing.
For real-time communication, anything less is unacceptable.


