Admin and User Portal increased error rates
Incident Report for JumpCloud
Postmortem

JumpCloud Incident Report

Date: 2023-08-22

Date of Incident: 2023-08-09

Description: RCA for Increased error rates on Console

Summary:

At approximately 15:00 MDT on 2023-08-09, some JumpCloud customers experienced failures requiring a retry while authenticating to JumpCloud’s Admin and User console.  The increased error rates were the result of a proxy cluster losing the ability to connect to one production load balancer for our container orchestration.  This degradation of service lasted until approximately 16:00 MDT on 2023-08-09.

Root Cause:

At this time, JumpCloud has not been able to reproduce this issue in any of our test environments, but there are two hypotheses:

  1. A bug in the cache interface for the proxy.  This bug may have surfaced during a regular rotation of IPs where the proxy held onto stale connections with a standard rotation of DNS.  However, JumpCloud has not witnessed any similar behavior prior or since, and can not replicate this behavior in any testing.
  2. More likely, there was a failure with our cloud provider where the downstream services were not handling the requests, and the clients timed out for a subset of IPs.  Unfortunately, key data was missing from our VPC flow logs to confirm this behavior. If this issue occurs again we will enable the flow logs to ensure we capture the necessary debugging information.

Corrective Actions / Risk Mitigation:

  1. Add required detail to VPC flow logs  - DONE
  2. Increased logging for proxy cluster - DONE
  3. Upgrade proxy cluster - Target 10/2023
Posted Aug 22, 2023 - 12:01 MDT

Resolved
This incident has been resolved.
Posted Aug 09, 2023 - 17:19 MDT
Monitoring
We continue to see recovery and a steady decrease in error rates, and are actively monitoring all services.
Posted Aug 09, 2023 - 17:00 MDT
Update
We are starting to see recovery, but are continuing to investigate this incident.
Posted Aug 09, 2023 - 16:17 MDT
Update
We are continuing to investigate this issue.
Posted Aug 09, 2023 - 15:12 MDT
Investigating
We are currently investigating an increase in error rates for the User and Admin Portals. We will update this incident as we know more.
Posted Aug 09, 2023 - 15:10 MDT
This incident affected: Admin Console, Multi-Tenant Portal (MTP), and User Console.