Date: Nov 7, 2025
Date of Incident: Nov 4, 2025
Description: RCA for Auth Database Degradation
Summary:
On November 4, 2025, a number of customers experienced intermittent failures, timeouts and increased latency when attempting to authenticate to multiple JumpCloud Services, including consoles, LDAP, RADIUS and SAML, or use Multi-Factor Authentication.
Root Cause:
The incident was triggered by an issue in the deployment process involving a database schema change and a subsequent application code release.
During this deployment, a planned database change unintentionally removed several database indexes required by the existing application code.
The sequence of failure was as follows:
- Deployment Order Error: The database schema change (which removed necessary indexes) was applied to the production database before the new application code (which did not require those indexes) was deployed.
- Performance Collapse: The existing, high-volume authentication code (used for functions like TOTP and push authentication) was forced to run against the now-inefficient database structure. Queries that normally took milliseconds suddenly took several seconds.
- Connection Exhaustion: These slow queries held database connections open for extended periods, quickly overwhelming the database server's available connection pool.
- Full Outage: With no available connections, the main authentication API could not communicate with the database, leading to 100% CPU utilization on the database server and triggering the intermittent timeouts and failures experienced by our customers.
Why Testing Did Not Catch This:
The issue was not identified during testing in our Development or Staging environments due to insufficient Load Simulation. The resource consumption issues and connection exhaustion only manifest under the extreme pressure of peak production traffic volume. The simulated load profiles in our lower environments were not sufficient to expose this specific failure mode.
Corrective Actions / Risk Mitigation:
- Mandatory schema change review - All database schema changes must now undergo an additional level of review to explicitly assess index dependencies and impact.
- New deployment phasing - We are implementing new tools and checks to enforce that application code dependent on a schema change is deployed before a database change is executed.
- Enhance alerting - We are implementing new monitors and alerts specifically for the Auth-API's database connection pool health and CPU utilization.
- Enhanced load testing - We are revisiting the load profiles used in our staging environments looking for opportunities to more accurately simulate peak production traffic.