Date: Apr 23, 2025
Date of Incident: Apr 18, 2025
Description: RCA for ADI Outage
Summary:
On April 18th at 1:27 PM MT JumpCloud experienced an issue affecting our Active Directory Integration (ADI) agents. Some customers using ADI sync and Delegated Auth experienced connectivity issues, resulting in failed directory synchronizations and authentication attempts. New ADI installs would have failed during the incident window, and ADI agents attempting to restart or reconnect during the incident window would have failed.
Root Cause:
The incident was caused by the expiration of a certificate used for mutual TLS (mTLS) authentication between ADI agents and our infrastructure. This certificate is essential for securing the communication channel between customer environments and JumpCloud services.
Why wasn’t this caught before the certificate expired?
In mid-2024 we migrated our global proxy infrastructure from EC2 into Kubernetes. During this transition, monitoring for the proxies that handle mTLS were mistakenly deleted and not recreated on the new infrastructure.
Why did it take so long to recover?
There are two reasons recovery took longer than it should have. The first reason is the missing monitoring above meant this went unnoticed by our operations teams once the cert expired. The second is our teams did not see any instability or outage signals in our internal telemetry, which led them to believe this was customer environment specific. This was because existing connections were un-impacted.
Corrective Actions / Risk Mitigation: