Date: 2024-06-21
Date of Incident: 2024-06-05
Description: RCA for Google Apps Directory Dispatch Delay
Summary:
At approximately 20:30 MDT on 2024-06-05, JumpCloud received a large ingress of messages in a Google Apps FIFO queue. By 21:40 MDT on 2024-06-05 this exceeded 20,000 messages causing processing delays for some customers. At 21:24 on 2024-06-05 an alert fired for this delay, but was incorrectly labeled as a non-production issue, causing it to be misrouted. At 07:15 MDT on 2024-06-06 the root cause was identified and a diversion queue was constructed where those messages were moved for processing. The processing delay lasted until approximately 11:30 MDT on 2024-06-06.
Root Cause:
A contiguous block of messages larger than 50,000 from a single organization was sent to the main Google Apps FIFO queue. Normally this would initiate a diversion queue construct prior to 20,000 messages, but was missed due to an incorrect alerting change during work in the Google Apps application layer. This FIFO queue is partitioned by organization ID, and once the limit of 20,000 messages is reached, only one message at a time can be processed regardless of the number of available worker threads.