Date: Mar 17, 2026
Date of Incident: Mar 12, 2026
Description: RCA for Agent Backend (HAProxy) System Degradation
Summary:
On March 12, 2026, from 10:05 AM to 2:45 PM MDT, JumpCloud experienced a significant service degradation affecting Agent-related activities. During this window, agent updates, including syncing users, passwords, policies and other agent data, as well as new agent installations were unavailable.
This was caused by a "thundering herd" event triggered by a backend traffic-shaping change. We have since identified the root causes and implemented infrastructure changes to prevent a recurrence.
What Happened?
At 10:00 AM MDT, our engineering team enabled a feature flag (a "circuit breaker") designed to protect our System Insights API from high load by returning 503 Service Unavailable responses for certain non-critical requests.
While the flag performed its intended function, it had an unforeseen secondary effect on the JumpCloud Agent’s connection logic. Because the agents could not reuse existing connections for these specific failed requests, hundreds of thousands of agents in our main production environment attempted to establish new mTLS (mutual TLS) connections simultaneously.
This created a "Thundering Herd" event that saturated our HAProxy ingress layer, exhausting CPU resources and causing a cascade of connection failures.
Root Cause:
The prolonged nature of this incident was the result of three distinct, overlapping bottlenecks that our team had to isolate and resolve one by one:
- CPU-Intensive SSL Handshaking: Establishing an mTLS connection is a CPU-intensive process. The sheer volume of simultaneous connection attempts pushed our HAProxy pods to their resource limits. This caused the pods to become unresponsive, leading to "Out of Memory" (OOM) kills and failed health probes.
- Health Check Death Spiral: Our internal health checks initially relied on a Layer 7 SSL validation. Because the CPU was 100% occupied with agent reconnections, the pods couldn't respond to their own health checks in time. This caused the system to erroneously mark healthy pods as "down”, removing them from the rotation and further overwhelming the remaining pods.
- Load Balancer Handshake Saturation: As we attempted to scale our infrastructure, the Application Load Balancer (ALB) encountered a throughput bottleneck specifically related to the rate of new connection establishments. The surge of agents attempting to negotiate new SSL handshakes at the same time exceeded the ALB's burst capacity, temporarily preventing even healthy backend pods from receiving and processing traffic.
Why It Took Time to Resolve:
While reverting the flag was the correct first step, the agents were already in an aggressive retry loop that continued even after the 503 errors stopped. We had to experiment with several configurations (adjusting health check intervals and timeout windows) to find a balance that allowed pods to stay "alive" long enough to process the backlog. Stability was achieved only once we implemented Concurrency Control. By lowering the maximum allowed concurrent connections per pod, we stopped the CPU from over-committing to handshakes, allowing the system to reliably process a controlled flow of traffic until the global queue cleared.
Corrective Actions / Risk Mitigation:
1.) Edge Infrastructure Hardening
We are standardized on a new high-availability configuration for our HAProxy ingress layer.
- Concurrency Governance: We have implemented a strict maxconn limit per pod. This acts as a "pressure valve," ensuring that the CPU remains available to process existing requests rather than becoming saturated by new connection attempts.
- Dynamic Capacity Management via Autoscaling: We are implementing Horizontal Pod Autoscaling (HPA) for our HAProxy ingress layer, calibrated to trigger based on both CPU utilization and active connection counts. This ensures we can absorb sudden traffic fluctuations and also maintain a controlled flow of requests to our backend services.
2.) Agent Connectivity Optimization
We are updating the JumpCloud Agent’s communication layer to be more "network-aware" during degraded states:
- Enhanced Connection Pooling: We are reconfiguring the agent's HTTP transport logic to maximize the reuse of existing idle connections. This significantly reduces the "Connection Tax" on our backend during high-traffic events.
- Streamlined Resource Handling: We are implementing stricter protocols for draining and closing HTTP response bodies, ensuring that pooled connections are returned to the rotation immediately and reliably.
3.) Adaptive Retry Logic (Jitter)
To further break up "synchronized" traffic spikes:
- Introduction of Jitter: While our agents currently use exponential backoff for poll requests, we are adding randomized "jitter" to our retry intervals. This spreads reconnection attempts across a wider window, preventing large blocks of agents from hitting the service at the exact same millisecond.
- Standardizing Resilient Retry Logic: We are transitioning the Agent’s default HTTP client to a unified exponential backoff model for all request types.
- Controlled Rollout: This update will be managed via a staged rollout to monitor for any unforeseen side effects on fleet-wide connectivity patterns.