Increased error rates with some calls to API endpoints
Incident Report for JumpCloud
Postmortem

JumpCloud Incident Report

Date: 2022-06-27

Date of Incident: 2022-06-22

Description: RCA for intermittent increased error rates returned on some API endpoints

Summary:

At approximately 07:20 MDT on 2022-06-22, some JumpCloud customers experienced an intermittent decrease in success rates returned from some API endpoints.  This behavior lasted until approximately 11:00 MDT on 2022-06-22.

Root Cause:

A sharp increase in container count saturated one host causing it to intermittently exceed specific kernel parameters. This caused some services dedicated to that host to experience delayed response times causing an increased error rate.

Corrective Actions / Risk Mitigation:

  1. Rescale all our hosts horizontally to reduce issue radius and limits - DONE
  2. Deploy tested and tuned kernel parameters across hosts - DONE
  3. Increased alerting around these thresholds - DONE
Posted Jun 22, 2022 - 22:40 MDT

Resolved
This incident has been resolved.
Posted Jun 22, 2022 - 13:04 MDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 22, 2022 - 12:23 MDT
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 22, 2022 - 10:50 MDT
Investigating
We are currently seeing an increase in error rates to some API endpoints. We are actively investigating and will report back as soon as we have any new information.
Posted Jun 22, 2022 - 09:48 MDT
This incident affected: General Access API.