
Feature flags are supposed to make deployments safer. By decoupling code deployment from feature release, you can merge to main, deploy to production, and slowly toggle features on for specific users. But when you hand feature flag generation over to automated AI agents tasked with resolving bugs at lightning speed, feature flags stop being a tool for safety and become a vector for catastrophic system complexity.
I’m a lead platform engineer. Last Thursday at 9:04 AM, our API gateway started reporting 504 Gateway Timeouts. Our CPU usage on our gateway pods spiked to 100%, and memory consumption hit the threshold limit, triggering a series of container restarts. We weren't under a DDoS attack, and our database was healthy. The culprit was a 15-megabyte JSON configuration payload containing over 1,200 active feature flags that our API gateway was trying to parse, validate, and evaluate for every single incoming HTTP request. The JSON parsing blocked the event loop, causing our gateway to drop connections.
How did we get to 1,200 active flags? The answer is automation. In our bid to increase deployment velocity, we had configured our AI coding agents to automatically wrap every bug fix, hotfix, and experimental UI change in a feature flag. If a test failed or a bug was discovered in production, the model would simply flip the flag off. This process was incredibly fast, but it accumulated silent technical debt. We created a zombie flag epidemic, built an untestable configuration space, and eventually took down our production API. Here is the post-mortem of how we collapsed under the weight of our own configuration, the nested conditional code forensics, and the strict rules we put in place to govern feature flagging.
The Combinatorial Explosion of Feature Flags
The core problem with feature flags is that they create a branch in your application path. If you have 1 flag, you have 2 possible paths. If you have 10 flags, you have 1,024 possible paths. If you have 1,200 flags, the number of possible system states is 2^1200—a number larger than the number of atoms in the observable universe. No QA suite, human or machine, can test that state space.
Because the AI coding agent was evaluated based on its ability to resolve individual Jira issues without breaking existing unit tests, it quickly learned that the safest way to modify code was to avoid modifying existing logic. Instead, it would copy the target function, make the changes in the duplicate, and gate the new code behind a feature flag. Over six months, the codebase became a maze of nested feature flag checks. Here is a typical example of what a core transaction utility looked like in our codebase before the crash:
// Bloated, AI-generated feature-flag spaghetti
export async function calculateTransactionFee(amount: number, user: User) {
const isLegacyFixEnabled = await ldClient.variation('enable_legacy_fee_rounding_fix_v2', user, false);
const isTieredPricingEnabled = await ldClient.variation('enable_tiered_pricing_agent_refactor', user, false);
if (isTieredPricingEnabled) {
const isEnterpriseOverride = await ldClient.variation('enterprise_pricing_override_4012', user, false);
if (isEnterpriseOverride) {
const isCustomSlaBilling = await ldClient.variation('custom_sla_billing_adjustment_v9', user, false);
if (isCustomSlaBilling) {
return calculateCustomSlaFee(amount, user);
}
return calculateEnterpriseOverrideFee(amount, user);
}
return calculateTieredFee(amount, user);
} else {
if (isLegacyFixEnabled) {
const isTaxRoundingHotfix = await ldClient.variation('tax_rounding_hotfix_3321_enabled', user, false);
if (isTaxRoundingHotfix) {
return calculateRoundedFeeWithTax(amount, user);
}
return calculateRoundedFeeLegacy(amount, user);
}
return amount * 0.029 + 0.30;
}
}
Look closely at this function. To evaluate a single transaction fee, the application had to resolve up to five asynchronous feature flags. In staging, with a local mock client returning static values, this lookup took under a millisecond. But in production, each variation() call evaluated user context against rules downloaded from our feature-flag service. The AI model didn't realize that it had introduced a micro-bottleneck that executed on every single transaction request.
Forensics: The JSON Deserialization Timeout
The immediate trigger for our outage was a routine deployment of a new billing dashboard. The billing agent checked in 14 new feature flags to manage the transition of individual widgets. Because the feature flag service (we were running an self-hosted instance of an open-source flags system) stores all rules in a single large configuration document, the manifest size reached 15.4MB.
Every time a gateway pod spawned or received a cache-invalidation webhook, it fetched this JSON manifest and parsed it. Because Node.js is single-threaded, parsing a 15MB JSON string is a synchronous, blocking operation. We isolated the exact block of code in our platform wrapper that caused the event loop blockage:
// The gateway wrapper that blocked the event loop
import { readFileSync } from 'fs';
export function loadLocalFlagManifest(filePath: string) {
console.time('JSON_Parse_Time');
// Synchronous read and parse of 15MB configuration file
const rawData = readFileSync(filePath, 'utf8');
const manifest = JSON.parse(rawData); // Event loop blocks here!
console.timeEnd('JSON_Parse_Time');
return manifest;
}
Our profiling logs showed that JSON.parse() on this file took 480 milliseconds on our production Kubernetes pods (running on AWS ECS with 1 vCPU allocations). For nearly half a second, the gateway could not process a single network event, read from sockets, or respond to health checks. Because our load balancer sends health checks every 2 seconds and marks a pod unhealthy if it doesn't respond within 1 second, all of our gateway pods were marked unhealthy and terminated. The load balancer had no pods left to route traffic to, and the site crashed.
Here are the system metrics captured during the event loop blockage:
| Time Offset | Gateway CPU Util. | Event Loop Delay | Active Web Connections | Status |
|---|---|---|---|---|
| T-5 mins | 18% | 2 ms | 840 | Healthy |
| T-0 mins (Manifest Update) | 98% | 492 ms | 1,920 (Queued) | Degraded |
| T+2 mins | 100% | 2,410 ms | 3,400 (Saturated) | Outage (Gateway Down) |
| T+15 mins (Fallback applied) | 12% | 1 ms | 150 | Recovered |
The system was literally choking on the code structure we had generated. The AI models had succeeded in optimizing our code release safety in theory, but had destroyed our actual production runtime capacity.
The Cleanup: Banning AI from Feature Flags
To recover from the outage, we temporarily injected a hardcoded config bypass that ignored the dynamic flag server and loaded a fallback manifest containing only the 50 core business flags. The site came back immediately. Once we had breathing room, we declared feature flag bankruptcy and set about purging the codebase. We removed over 850 zombie flags in a 48-hour sprint.
To prevent this from happening again, we implemented three strict, automated governance rules:
1. Flag Expiration Linter (The AST Guard)
We wrote a custom AST (Abstract Syntax Tree) lint rule using ESLint. Every feature flag declaration in our code must include an expirationDate metadata field in its config, which cannot be further than 30 days in the future. If a developer (or an AI agent) commits code containing a flag with no expiration date, or an expired date, the build fails in CI:
// Example of our strict feature flag definition schema
export const flags = {
"new-billing-summary-card": {
owner: "billing-team",
expirationDate: "2026-07-06", // Enforced by CI
defaultValue: false,
description: "Renders the new analytics summary card inside billing dashboard"
}
};
2. Automated Flag Cleanup Script
Instead of relying on AI to manually delete flags, we wrote a script that parses our codebase, checks our LaunchDarkly API for flags that have been rolled out to 100% of users for more than 14 days, and alerts the engineering team via Slack. The script outputs a list of files that can be cleaned up:
#!/usr/bin/env bash
# AST search script for tracking zombie flags
echo "Searching for flags with 100% rollout to purge..."
curl -s -H "Authorization: $LD_API_KEY" https://app.launchdarkly.com/api/v2/flags/default | jq -r '.items[] | select(.environments.production.rollout.rules == [] and .environments.production.offVariation == 0) | .key' | while read flagKey; do
echo "Zombie Flag Found: $flagKey"
grep -rnw './lib' -e "$flagKey"
done
3. Flag Budgeting
We instituted a "Flag Budget". No team is allowed to have more than 15 active flags in production at any given time. If a team wants to create a new flag, they must first delete an old one and clean up its corresponding code paths. This forces teams to perform the necessary refactoring to delete dead branches before they can start new initiatives.
The Cost of Configuration
The 504 outage lasted 34 minutes, resulting in an estimated $45,000 in lost transaction volume. More importantly, it eroded our customers' trust in our system’s availability. We realized that feature flags are not free. They carry a cognitive cost for developers, a testing cost for QA, and a concrete runtime parsing cost for the engine.
AI models excel at local context. They can look at a single function, write a flag, and get it working. But they are incapable of understanding the systemic cost of adding 1,000 branches to an application. Modularity is only useful if the system remains understandable. If your codebase requires an astronomical number of flags to remain stable, your problem isn't deployment safety—it’s code quality.
Conclusion
We stopped letting AI manage our feature flags. Every flag is now registered manually, reviewed by a platform architect, and governed by a strict lifecycle budget. Feature flags are a scalpel, not a net. Use them precisely, use them sparingly, and clean them up as soon as their job is done. Your codebase, and your production gateway, will thank you.
Written by XQA Team
Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.