3rd Party Provider Operational Issues

Incident Report for JumpCloud

Postmortem

Date: Oct 27, 2025

Date of Incident: Oct 20, 2025

Description: RCA for Service Degradation Linked to AWS US-EAST-1 Regional Disruption

‌

Summary:

On October 20, 2025, between approximately 07:00 UTC and 10:00 UTC, the JumpCloud platform experienced significant performance degradation. This primarily affected the responsiveness of our core APIs, administrative console access, user console access, single sign-on (SSO) functions for customers, failed trial creation, user updates, and the potential loss of some Directory Insights events. Residual performance degradation, affecting services like our Privileged Access Management (PAM) feature and specific Support Portal APIs, persisted until approximately 17:00 UTC, at which point all services were fully restored.

‌

The degradation was not caused by a failure within the JumpCloud platform code or infrastructure configuration, but was a direct consequence of a severe, cascading failure within the Amazon Web Services (AWS) US-EAST-1 region. Although many of our services are deployed across multiple Availability Zones (AZs) within the region for resilience, the nature of the AWS issue - impacting fundamental regional services - compromised inter-AZ communication preventing some of our standard failover mechanisms from operating successfully. Services returned to full operational status after AWS reported stability with the impacted foundational services.

‌

Root Cause:

Based on the post-incident analysis of AWS, the service disruption was not a single event but a sequence of cascading failures across three fundamental AWS services.

‌

DynamoDB Failure Due to Latent DNS Race Condition (Initial Trigger)
EC2 Launch Congestion and Network Propagation Delays (Sustained Impact)
Network Load Balancer (NLB) Health Check Instability (Final Phase)

‌

All affected service teams were paged by our monitoring and alerting systems, and our incident management team was engaged to coordinate efforts and assess areas where we could throttle traffic to stabilize remaining capacity.

‌

Corrective Actions / Risk Mitigation:

While our current architecture mitigates some single-Availability Zone (AZ) failures, our primary focus is to eliminate single-region dependency entirely. JumpCloud is actively engaged in strategic engineering initiatives to further strengthen the platform's foundation. We are conducting a thorough review of cross-region dependencies and replication strategies to enhance our service's resilience against widespread environmental disruptions, always striving to meet the highest standards of availability.

Posted Oct 27, 2025 - 13:19 MDT

Resolved

This incident has been resolved.

Posted Oct 20, 2025 - 04:14 MDT

Monitoring

We are continuing to monitor as we see further recovery with all services.

Posted Oct 20, 2025 - 04:02 MDT

Identified

We are monitoring our services, and starting to see some recovery.

Posted Oct 20, 2025 - 03:11 MDT

Investigating

Due to an issue with our Cloud Provider, JumpCloud is experiencing intermittent issues with multiple services. We are investigating this with our provider and will update this incident as we know more.

Posted Oct 20, 2025 - 01:52 MDT

This incident affected: Admin Console (Admin Console, Admin Console - EU Region) and User Console (User Console, User Console - EU Region).