Agent propagation degradation

Incident Report for JumpCloud

Postmortem

JumpCloud Incident Report

Date: 2022-05-31

Date of Incident: 2022-05-23

Description: RCA for JumpCloud Agent Service Degradation

Summary:

At approximately 20:20 MDT on 2022-05-23, JumpCloud customers experienced issues adding a new Agent device, slow propagation of user and device changes, and poor agent check in response.  This lasted until approximately 21:00 MDT on 2022-05-23.

Root Cause:

In an effort to modernize our container infrastructure for the Agent service layer, a change was deployed to our edge servers containing a bad configuration.  This change was not exposed until a load threshold was crossed, which happened at the time of a large Agent version update to many systems.  The data at the time showed the likely suspect as being the change in Agent version which was then rolled back.  When the Agent version was rolled back, this resulted in lower load to our edge servers by allowing the Agent to ignore certain requests, allowing the servers to perform as expected.  Unfortunately, JumpCloud experienced the same issue again on 2022.05.23, as at that time JumpCloud believed the root cause was Agent version and not the change applied to our edge servers.  The change to our edge servers was then rolled back.

Corrective Actions / Risk Mitigation:

Production deployments are temporarily paused until we have required changes and coverage in place.

  • Rollback of the change to JumpCloud edge servers - DONE
  • Better Agent load testing operations in our test environment - Target 06/2022
  • Agent critical path separation / isolation - Target 09/2022
Posted May 31, 2022 - 15:52 MDT

Resolved

This incident has been resolved.
Posted May 23, 2022 - 00:40 MDT

Identified

All Agent traffic, including events should be communicating successfully. We are continuing to work on some other changes to add additional resilience.
Posted May 22, 2022 - 23:57 MDT

Update

Agents are communicating successfully, and new agent install / registrations should now be successful. We are still seeing some issues with agent event traffic that we are continuing to investigate.
Posted May 22, 2022 - 22:33 MDT

Investigating

We are currently investigating an issue with delays adding new agent devices, as well as delays in syncing user, commands, and policy information with the JumpCloud Devices Agent. We will report back as soon as we have an update. We apologize for any disruption this may cause.
Posted May 22, 2022 - 21:48 MDT
This incident affected: Agent (Agent).