Admin Portal MFA Login

Incident Report for JumpCloud

Postmortem

JumpCloud Incident Report

‌

Date: 2023-12-20

Date of Incident: 2023-12-18

Description: RCA for Admin Portal Login with MFA

‌

Summary:

At 08:26 MST on 2023-12-18, some JumpCloud customers experienced failures while authenticating to JumpCloud’s Admin console using MFA. Two additional issues were present during this time. First, any password update for users would fail to complete. Second, if a user or admin tried to change their TOTP, subsequent attempts to sign-in with that TOTP would fail. This degradation of service lasted until 09:49 MST on 2023-12-18. Admins that attempted a TOTP reset during the incident did experience lingering issues requiring further intervention.

‌

Root Cause:

At JumpCloud we extensively use the principle of least privilege (PoLP). To accomplish this, our services use IAM policies to access only the resources they need to successfully fulfill requests. To manage those IAM policies we use Infrastructure as Code (IaC) to ensure the policies have the proper reviews before getting merged into our git repositories and released. Our release pipelines allow us to validate and test IAM policy changes in our pre-production environment before applying the changes to production. During the validation step we run what is called a speculative plan, which shows us what the changes are going to look like so the team can do a final review without having to apply the changes.

On December 18th at 08:07 MT the plan for the production policy change ran successfully showing us what the policy would look like. But, due to differences in the test application and the production application the size of the IAM policy differed, and because we did not have any test to check the policy size all tests were passing. At 08:26 MT the team moved forward to apply the changes to the production service. Most policy changes applied without issue, but one policy exceeded the character limit and failed. This caused the service to be unable to reach the required resources needed to fulfill its requests. At 08:32 MT the team was alerted to elevated failure rates on the admin login window and began investigating. At 08:52 MT the team identified the problematic policy and attempted to fix it and roll forward. Unfortunately the fix was unsuccessful and the team pivoted to roll the change back. The rollback completed at 09:47 MT and service was restored.

One side-effect of this incident was that if updates to TOTP devices were made, our service failed in an unexpected way. End user devices showed the new TOTP, but our backend was unable to actually save that change. This caused users who made updates during the incident window to fail subsequent logins using their new TOTP. This failure mode was not apparent to the team investigating the incident, so no updates to our status page were made.

‌

Corrective Actions / Risk Mitigation:

Revert the offending policy change - DONE
Introduce a manual stop-gap process until action 4 is completed. - DONE
Process updated for Statuspage requiring additional communication on any possible lingering effects - DONE
Additional automated safeguards for changes to managed policies - TARGET 01/2024
Fix failure mode to prevent unexpected password or device state - IN REVIEW

Posted Dec 20, 2023 - 07:57 MST

Resolved

This incident has been resolved.

Posted Dec 18, 2023 - 09:59 MST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 18, 2023 - 09:53 MST

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 18, 2023 - 08:58 MST

Investigating

We are currently experiencing intermittent failures with MFA login to the Admin Portal, and the inability to change passwords. We are actively working this issue and will update as we know more.

Posted Dec 18, 2023 - 08:54 MST

This incident affected: Admin Console.