How Cloudflare Strengthened Its Network Through the 'Code Orange: Fail Small' Initiative

By ⚡ min read

Introduction: Building a More Resilient Network

Over the past several months, Cloudflare embarked on an ambitious engineering initiative internally called Code Orange: Fail Small. The goal was straightforward yet critical: make the network infrastructure more resilient, secure, and reliable for every customer. While resilience is an ongoing priority—never a task to be marked complete—the team has now finished the core work that would have prevented the global outages on November 18, 2025, and December 5, 2025. This article dives into the key improvements delivered and what they mean for your traffic and services.

How Cloudflare Strengthened Its Network Through the 'Code Orange: Fail Small' Initiative — Source: blog.cloudflare.com

Safer Configuration Changes

One of the primary focus areas was how configuration changes are deployed across Cloudflare’s network. Previously, internal configuration updates could propagate instantly, raising the risk of widespread impact if a change introduced a fault. Now, through a new methodology called health-mediated deployment, configuration changes are rolled out gradually with real-time health monitoring. This allows observability tools to detect problems early and automatically revert changes before they affect customer traffic.

Introducing Snapstone: A Unified Health-Mediated Deployment System

At the heart of this improvement is a new internal component named Snapstone. Snapstone bundles configuration changes into packages and releases them step by step, applying health mediation principles. Before Snapstone, teams had to create custom solutions to achieve progressive rollout—a cumbersome process that was inconsistently applied. Snapstone closes this gap by providing a default, unified mechanism for gradual deployment, continuous health checks, and automated rollbacks.

What sets Snapstone apart is its flexibility. It is not a fix for a single past failure but a versatile tool that can handle any configuration unit requiring health mediation. Whether it’s a data file like the one that caused the November 18 outage, or a control flag in the global configuration system like the one involved in the December 5 outage, teams can define these units and let Snapstone manage them. This adaptability ensures that high-risk configuration pipelines are now safer by design.

Reducing the Impact of Failure

Beyond configuration changes, Cloudflare focused on limiting the blast radius of any potential failure. The “fail small” philosophy means that if an issue does occur, it should affect only a small portion of the network, not the entire global infrastructure. This involved reworking internal systems to isolate failures and implementing redundancy at multiple layers. The result is a network that can absorb localized problems without cascading into widespread outages.

Revising Incident Management and Communication

Another critical area was the revision of break glass procedures—emergency access protocols used during incidents—and overall incident management. Teams streamlined how they respond to emergencies, ensuring faster containment and resolution. Additionally, communication with customers during outages was strengthened. Clear, timely updates are now a priority, so that you are always informed about the status of your services and the steps being taken to restore normal operations.

Preventing Drift and Ensuring Long-Term Stability

To prevent regressions over time, Cloudflare introduced measures to guard against configuration drift. This means that as the network evolves, the improvements made under Code Orange remain in place. Automated checks, periodic audits, and strict change control processes ensure that resiliency gains are not eroded by future updates. The goal is to maintain a consistently high level of reliability without requiring constant manual oversight.

What This Means for Cloudflare Customers

For most users, the most visible change is increased stability. Internal configuration changes no longer reach the network instantly; instead, they are rolled out progressively with health monitoring that catches problems early. This dramatically reduces the chance of configuration errors affecting your traffic. Furthermore, the enhanced incident management and communication mean that if a problem does occur, you will receive clearer and more frequent updates.

The completion of Code Orange: Fail Small is a major milestone, but Cloudflare acknowledges that improving resilience is an ongoing journey. The tools and processes introduced—especially Snapstone—provide a strong foundation for future enhancements. By making it easier to deploy changes safely and to fail small when things go wrong, Cloudflare has built a network that is more robust and trustworthy than ever.

For more details on Cloudflare’s ongoing reliability work, refer to the Snapstone section above and stay tuned for future updates.