Directory Dispatch Delays

Incident Report for JumpCloud

Postmortem

Date: Apr 7, 2026

Date of Incident: Mar 30, 2026

Description: RCA for Directory Association Processing Delays

Summary:

Starting March 30th at approximately 15:40 MDT, JumpCloud customers experienced significant delays in directory-related updates. This included latency in password changes, user-to-group associations, and outbound provisioning reflecting in downstream systems. The root cause was identified as a specific code deployment in our Devices service that inadvertently flooded a background processing queue with unpartitioned messages, causing a bottleneck that prevented updates from processing in real-time. The issue was fully resolved by 00:25 MDT on March 31, 2026.

What Happened:

The incident was caused by a change in how the JumpCloud agent retrieves software application configurations.

  1. Traffic Spike: The new code shifted the "source of truth" for these configurations to a new database. If a device polled the system and did not find its record in the new database, the code automatically enqueued a "track collect" request to sync the data.
  2. Unexpected Volume: We anticipated a "lazy backfill" (where records are created over time), but underestimated the number of devices that had no existing software bindings. This resulted in an immediate, massive spike of nearly 280,000 messages.
  3. The Bottleneck (Partitioning): Crucially, these specific messages were enqueued without a "Partition ID." In our high-scale FIFO (First-In-First-Out) queue architecture, messages without a partition ID are processed one-by-one rather than in parallel. This effectively "serialized" the queue, preventing us from scaling up workers to process the backlog faster and causing the observed latency.

Resolution and Recovery:

Once the offending code was rolled back, the "tap" was turned off, and no further unpartitioned messages were added to the queue.

Because the bottleneck was caused by the lack of partitioning, simply scaling horizontally could not speed up the processing of the existing backlog. The team monitored the queue throughput and determined that the safest and fastest path to recovery was allowing the worker to process the existing messages sequentially rather than risking further disruption by attempting to manually manipulate the production queue.

Corrective Actions:

To ensure this type of bottleneck does not occur again, we have committed to the following:

  • Improving pre-production testing to better simulate the scale and conditions that can occur in production queue processing
  • Reviewing other areas of the platform where similar patterns could produce unexpected request spikes
  • Enhancing monitoring and alerting thresholds to enable faster detection and response when queue backlogs begin to form
  • Strengthening our deployment validation process to more thoroughly account for background data migrations before releasing dependent code changes
Posted Apr 07, 2026 - 08:57 MDT

Resolved

This incident has been resolved.
Posted Mar 31, 2026 - 00:24 MDT

Monitoring

The backlog affecting Directory services continues to decrease and we are approaching resolution. Administrators may still experience some delays when updating associations.

The team is continuing to monitor, and our next update will be to confirm full resolution.
Posted Mar 30, 2026 - 23:00 MDT

Update

The backlog affecting Directory services has been significantly reduced and continues to decrease. Administrators may still experience delays when updating associations for users, groups, policies, directories, and commands.

The team continues to work on additional changes to accelerate processing. We will provide further updates in an hour.
Posted Mar 30, 2026 - 21:42 MDT

Update

We have made progress in reducing the backlog affecting Directory services and are continuing to work through the remaining queue. Administrators may still experience delays when updating associations for users, groups, policies, directories, and commands.

The team is actively implementing additional changes to increase processing capacity and accelerate resolution. We will provide further updates in one hour.
Posted Mar 30, 2026 - 20:38 MDT

Identified

JumpCloud is currently experiencing dispatch delays for core Directory services. Administrators may experience delays updating associations for users, groups, policies, directories, and commands. We have identified the cause of the issue and are actively implementing a fix.
Posted Mar 30, 2026 - 18:32 MDT
This incident affected: Admin Console (Admin Console).