JumpCloud Incident Report
Date: 2023-08-22
Date of Incident: 2023-08-09
Description: RCA for Increased error rates on Console
Summary:
At approximately 15:00 MDT on 2023-08-09, some JumpCloud customers experienced failures requiring a retry while authenticating to JumpCloud’s Admin and User console. The increased error rates were the result of a proxy cluster losing the ability to connect to one production load balancer for our container orchestration. This degradation of service lasted until approximately 16:00 MDT on 2023-08-09.
Root Cause:
At this time, JumpCloud has not been able to reproduce this issue in any of our test environments, but there are two hypotheses:
- A bug in the cache interface for the proxy. This bug may have surfaced during a regular rotation of IPs where the proxy held onto stale connections with a standard rotation of DNS. However, JumpCloud has not witnessed any similar behavior prior or since, and can not replicate this behavior in any testing.
- More likely, there was a failure with our cloud provider where the downstream services were not handling the requests, and the clients timed out for a subset of IPs. Unfortunately, key data was missing from our VPC flow logs to confirm this behavior. If this issue occurs again we will enable the flow logs to ensure we capture the necessary debugging information.
Corrective Actions / Risk Mitigation:
- Add required detail to VPC flow logs - DONE
- Increased logging for proxy cluster - DONE
- Upgrade proxy cluster - Target 10/2023