Date of Incident: 2022-05-23
Description: RCA for JumpCloud Service Interruption
At approximately 20:45 MDT on 2022-05-23, JumpCloud customers experienced the inability to access JumpCloud’s Admin and User consoles. This loss of access also affected JumpCloud’s API, and lasted until approximately 21:05 MDT on 2022-05-23.
The incident was caused by a shared code base component getting released to production inadvertently. This branch of code passed testing, and deployed to production earlier in the day. It immediately failed, and rolled back without issue. Upon rollback, there was a failure in changing the version state to “not approved”, and the deploy mechanism viewed this code in the test environment as “passed”. During unrelated maintenance this version was again released to production due to the “passed” value recognized by the deploy mechanism. This deployment missed the canary gate which did not rollback effectively, and required a manual rollback.
Corrective Actions / Risk Mitigation:
Production deployments are temporarily paused until we have required changes and coverage in place