Date: 2022-05-31
Date of Incident: 2022-05-23
Description: RCA for JumpCloud Agent Service Degradation
Summary:
At approximately 20:20 MDT on 2022-05-23, JumpCloud customers experienced issues adding a new Agent device, slow propagation of user and device changes, and poor agent check in response. This lasted until approximately 21:00 MDT on 2022-05-23.
Root Cause:
In an effort to modernize our container infrastructure for the Agent service layer, a change was deployed to our edge servers containing a bad configuration. This change was not exposed until a load threshold was crossed, which happened at the time of a large Agent version update to many systems. The data at the time showed the likely suspect as being the change in Agent version which was then rolled back. When the Agent version was rolled back, this resulted in lower load to our edge servers by allowing the Agent to ignore certain requests, allowing the servers to perform as expected. Unfortunately, JumpCloud experienced the same issue again on 2022.05.23, as at that time JumpCloud believed the root cause was Agent version and not the change applied to our edge servers. The change to our edge servers was then rolled back.
Corrective Actions / Risk Mitigation:
Production deployments are temporarily paused until we have required changes and coverage in place.