Date: Jan 17, 2025
Date of Incident: Jan 16, 2025
Description: RCA for Apple MDM Un-enrollment
To start, we’d like to acknowledge and apologize for the impact this incident had. We pride ourselves in operating with excellence and have increased efforts to minimize impact when things go wrong. We missed the mark here.
Summary:
On January 16th at 11:38 AM MT JumpCloud deployed a new version of Apple MDM. At 11:47 AM JumpCloud detected macOS devices un-enrolling from Apple MDM and an investigation began. At 12:03 PM a formal incident was declared and our incident management team came online to coordinate multiple teams in recovery efforts. At 12:14 PM MT a feature flag was disabled, stopping the queuing of any further un-enrollments (more on this later). At 12:15, the new code was completely rolled back, and at 12:32 JumpCloud deleted the command queue containing any further un-enrollment directives. At this point, the teams worked to find the quickest solutions to provide customers for safely re-enrolling systems, and a comms plan for affected customers making them aware of the issue.
Root Cause:
First, let's start with some context on how JumpCloud uses feature flags. Our practice is to deliver code to production through small changes. This reduces the risks inherent with large changes and allows us to quickly identify what specific change could be responsible for erroneous behavior. Feature flags are essentially if-statements in the code that determine which path to follow and execute. When a flag is “on” new code is executed, and when the flag is “off” the code is skipped. Our teams use these often with our deployments, providing the ability to turn features on or off based on certain attributes (like organization id) without modifying the source code. This is where things went wrong.
The actual change deployed was in an effort to introduce Apple Declarative Device Management (DDM) support, which lets devices apply configurations independently based on certain criteria. For an existing device we match on a number of factors, one of which being the unique device identifier (UDID) generated at enrollment. This is a security measure to prevent impersonation or a fake device. Unfortunately, with this new API the UDID did not match for some macOS devices, causing device un-enrollment. This was not caught in our pre-production environments because the intended state for the feature flag controlling this change was off. In the pre-production environments our testing passed because this code path was not active.
Why was the feature flag on? The teams use a rollout status with each flag, along with other identifiers for that code block. What we missed was a validation step ensuring the feature flag was in the expected state before releasing the code to production.
What are you doing to ensure this doesn’t happen again? With every incident, we perform a thorough post incident review to address gaps in many areas, including process and testing. We take these very seriously and discuss all incidents on a bi-weekly cadence with the entire engineering organization. These reports then get rolled up to our executive staff. This incident clearly exposed a gap in validation and we have changes in flight immediately to address that. We’ve also stopped any future deployments until this gap (and any others we find with this investigation) is closed and approved by our SRE team.
Corrective Actions / Risk Mitigation: