6 Surprising Lessons from a CUBIC Congestion Control Bug in QUIC
In the world of network protocols, even the most robust algorithms can harbor hidden quirks. CUBIC, the default congestion controller in Linux (standardized in RFC 9438), governs how most TCP and QUIC connections probe bandwidth and handle loss. At Cloudflare, our open-source QUIC implementation (quiche) relies on CUBIC for a significant portion of traffic. This article unveils a peculiar bug where CUBIC's congestion window gets permanently stuck at its minimum, never recovering from collapse. The journey began with a Linux kernel update aligning CUBIC with the app-limited exclusion rule—a fix that, when ported to quiche, triggered unexpected failures. The resolution? A near one-line code change that elegantly restored sanity. Here are six key takeaways from this investigation.
1. CUBIC's Core Logic: A Quick Refresher
CUBIC operates by adjusting the congestion window (cwnd)—a sender-side cap on bytes in flight. When no packet loss occurs, CUBIC aggressively increases cwnd to maximize bandwidth utilization. Upon detecting loss, it assumes network capacity is exceeded and reduces cwnd. This loss-based approach assumes that packet loss signals congestion. However, this logic has limitations, especially in modern networks with diverse traffic patterns. Understanding this foundation is crucial to grasping how a small bug can cause big problems.

2. The App-Limited Exclusion: A Necessary Fix
RFC 9438 §4.2-12 introduced the app-limited exclusion principle: when a connection has no data to send (app-limited), CUBIC should not adjust its cwnd based on that idle period. This prevents cwnd from growing unnecessarily when the sender has no data. The Linux kernel implemented this fix to solve a real TCP issue, ensuring CUBIC only operates when there is actual data flow. This change seemed straightforward, but its implications for QUIC were not immediately obvious.
3. Porting to QUIC: Unexpected Surfaces
When Cloudflare ported the app-limited exclusion fix from Linux's CUBIC to quiche (their QUIC implementation), the behavior deviated from expectations. QUIC, unlike TCP, has multiplexed streams and different acknowledgment mechanisms. The fix, originally designed for TCP stacks, interacted poorly with quiche's handling of idle periods and loss recovery. This mismatch triggered a cascade of errors, demonstrating how protocol-specific nuances can break seemingly universal fixes.
4. The Symptom: A Test Fails 61% of the Time
Our investigation began with erratic failures in ingress proxy integration tests. The test simulated heavy packet loss early in a connection—a scenario where CUBIC should reduce cwnd and then recover. However, in 61% of runs, the connection never recovered; cwnd stayed at its minimum value. This was not a steady-state issue but a corner case in recovery after congestion collapse. Such bugs are rarely caught by typical throughput tests, highlighting the importance of edge-case testing.

5. Root Cause: Cwnd Pinned at Minimum
Digging into the code, we found that the app-limited exclusion check inadvertently prevented cwnd from ever increasing after collapse. When the connection was in recovery and became app-limited (no data to send for a brief moment), the fix forced CUBIC to freeze cwnd. Since recovery involves multiple rounds of sending small amounts of data, each round trip could hit an app-limited state, permanently locking cwnd at its minimum. The algorithm assumed that app-limited meant 'no congestion,' but it misapplied this assumption during loss recovery.
6. The Elegant One-Line Fix
The solution was surprisingly simple: a single line that reordered the app-limited check. By ensuring that the app-limited logic only applies when the connection is not in recovery, we allowed CUBIC to grow cwnd normally after congestion collapse. This fix restored recovery behavior without breaking the intended app-limited exclusion. It underscores how a small change can resolve a complex bug, and reminds us that even well-tested algorithms have delicate state interactions.
The story of this bug is a testament to the intricacies of network protocol implementations. What worked for TCP in the Linux kernel needed careful adaptation for QUIC. The fix not only solved the 61% failure rate but also improved the reliability of Cloudflare's QUIC traffic. For developers working on congestion control, this serves as a valuable lesson: always test corner cases, and never assume that a patch from one stack will seamlessly transfer to another.
Related Articles
- Mastering Security Patch Management: A Comprehensive Guide to Applying Updates
- Linux DMA-BUF Subsystem Set for Major Efficiency Boost: User-Space Read/Write Operations on the Horizon
- Linux 7.2 to Bring AMDGPU Power Module Closer to Windows Performance
- Fedora Workstation 44: A Refined GNOME Experience with Enhanced Parental Controls
- Upgrade Your Fedora Silverblue to Version 44: A Complete Rebase Guide
- 10 Game-Changing Performance Wins in Linux 7.1-rc1 for AMD Ryzen Threadripper
- Extended Ubuntu Server Outage: DDoS Attack Linked to Pro-Iran Group
- Achieving Secure Boot Chains: Testing Sealed Bootable Container Images for Fedora Atomic Desktops