Date of Incident: 2023-12-18
Description: RCA for Admin Portal Login with MFA
At 08:26 MST on 2023-12-18, some JumpCloud customers experienced failures while authenticating to JumpCloud’s Admin console using MFA. Two additional issues were present during this time. First, any password update for users would fail to complete. Second, if a user or admin tried to change their TOTP, subsequent attempts to sign-in with that TOTP would fail. This degradation of service lasted until 09:49 MST on 2023-12-18. Admins that attempted a TOTP reset during the incident did experience lingering issues requiring further intervention.
At JumpCloud we extensively use the principle of least privilege (PoLP). To accomplish this, our services use IAM policies to access only the resources they need to successfully fulfill requests. To manage those IAM policies we use Infrastructure as Code (IaC) to ensure the policies have the proper reviews before getting merged into our git repositories and released. Our release pipelines allow us to validate and test IAM policy changes in our pre-production environment before applying the changes to production. During the validation step we run what is called a speculative plan, which shows us what the changes are going to look like so the team can do a final review without having to apply the changes.
On December 18th at 08:07 MT the plan for the production policy change ran successfully showing us what the policy would look like. But, due to differences in the test application and the production application the size of the IAM policy differed, and because we did not have any test to check the policy size all tests were passing. At 08:26 MT the team moved forward to apply the changes to the production service. Most policy changes applied without issue, but one policy exceeded the character limit and failed. This caused the service to be unable to reach the required resources needed to fulfill its requests. At 08:32 MT the team was alerted to elevated failure rates on the admin login window and began investigating. At 08:52 MT the team identified the problematic policy and attempted to fix it and roll forward. Unfortunately the fix was unsuccessful and the team pivoted to roll the change back. The rollback completed at 09:47 MT and service was restored.
One side-effect of this incident was that if updates to TOTP devices were made, our service failed in an unexpected way. End user devices showed the new TOTP, but our backend was unable to actually save that change. This caused users who made updates during the incident window to fail subsequent logins using their new TOTP. This failure mode was not apparent to the team investigating the incident, so no updates to our status page were made.
Corrective Actions / Risk Mitigation: