Google Workspace Sync issues
Incident Report for JumpCloud
Postmortem

Date: 2024-06-21

Date of Incident: 2024-06-05

Description: RCA for Google Apps Directory Dispatch Delay

Summary:

At approximately 20:30 MDT on 2024-06-05, JumpCloud received a large ingress of messages in a Google Apps FIFO queue.  By 21:40 MDT on 2024-06-05 this exceeded 20,000 messages causing processing delays for some customers.  At 21:24 on 2024-06-05 an alert fired for this delay, but was incorrectly labeled as a non-production issue, causing it to be misrouted.  At 07:15 MDT on 2024-06-06 the root cause was identified and a diversion queue was constructed where those messages were moved for processing.  The processing delay lasted until approximately 11:30 MDT on 2024-06-06.

Root Cause:

A contiguous block of messages larger than 50,000 from a single organization was sent to the main Google Apps FIFO queue.  Normally this would initiate a diversion queue construct prior to 20,000 messages, but was missed due to an incorrect alerting change during work in the Google Apps application layer. This FIFO queue is partitioned by organization ID, and once the limit of 20,000 messages is reached, only one message at a time can be processed regardless of the number of available worker threads.

Contributing Factors

  • Monitoring Gaps: Lack of effective monitoring and alerting for this queue and processing latency.
  • Scaling Issues: Workers were scaled vertically but not timely enough to handle the sudden spike in queue size.
  • Delay Building Diversion Queue: Unforeseen issues requiring manual changes extending the time to completion for the diversion queue.

Immediate Actions Taken

  1. Scaling Up: Increased the number of worker instances to process the backlog of messages.
  2. Configuration Adjustments: Reviewed and adjusted the FIFO queue and worker settings for optimal performance.
  3. Diversion Queue: Diversion queue created for influx of messages.

Long-term Corrective Actions

  1. Automate Message Diversion: Review and optimize diversion queue policies ensuring rapid queue build.
  2. Monitoring and Alerts: Review and increase comprehensive monitoring and alerting mechanisms for early detection of queue build-ups and processing delays.
  3. Documentation and Training: Document the incident and update operational runbooks.
Posted Jun 21, 2024 - 12:20 MDT

Resolved
This incident has been resolved.
Posted Jun 06, 2024 - 12:43 MDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 06, 2024 - 11:34 MDT
Update
We are continuing to work on a fix for this issue.
Posted Jun 06, 2024 - 09:08 MDT
Identified
We have identified the issue and are working on a solution.
Posted Jun 06, 2024 - 06:44 MDT
Investigating
We have received reports that synchronization of user information between JumpCloud and Google Workspace is not occurring as expected. We are investigating these reports.
Posted Jun 06, 2024 - 05:47 MDT
This incident affected: G Suite Integration.