Degraded Agent Service
Incident Report for JumpCloud
Postmortem

JumpCloud Incident Report

Date: 2022-05-31

Date of Incident: 2022-05-19

Description: RCA for JumpCloud Agent Service Degradation

Summary:

At approximately 08:20 MDT on 2022-05-19, JumpCloud customers experienced issues adding a new Agent device, slow propagation of user and device changes, and poor agent check in response.  This lasted until approximately 10:36 MDT on 2022-05-19.

Root Cause:

In an effort to modernize our container infrastructure for the Agent service layer, a change was deployed to our edge servers containing a bad configuration.  This change was not exposed until a load threshold was crossed, which happened at the time of a large Agent version update to many systems.  The data at the time showed the likely suspect as being the change in Agent version which was then rolled back.  When the Agent version was rolled back, this resulted in lower load to our edge servers by allowing the Agent to ignore certain requests, allowing the servers to perform as expected.  Unfortunately, JumpCloud experienced the same issue again on 2022.05.23, as at that time JumpCloud believed the root cause was Agent version and not the change applied to our edge servers.  The change to our edge servers was then rolled back.

Corrective Actions / Risk Mitigation:

Production deployments are temporarily paused until we have required changes and coverage in place.

  • Rollback of the change to JumpCloud edge servers - DONE
  • Better Agent load testing operations in our test environment - Target 06/2022
  • Agent critical path separation / isolation - Target 09/2022
Posted May 31, 2022 - 15:50 MDT

Resolved
This incident has been resolved.
Posted May 19, 2022 - 11:11 MDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 19, 2022 - 10:37 MDT
Investigating
We are currently investigating an issue with adding new agent devices, and delays in syncing user, commands, and policy information with the JumpCloud Devices Agent. We will report back as soon as we have an update. We apologize for any disruption this may cause.
Posted May 19, 2022 - 09:08 MDT
This incident affected: Agent.