Degraded RADIUS Service

Incident Report for JumpCloud

Postmortem

Incident Report

Date: Jun 12, 2025

Date of Incident: Jun 10, 2025

Description: RCA for USE1 RADIUS Service Disruption

Summary:

On June 10th, between approximately 13:25 UTC and 15:30 UTC, following a planned operating system upgrade in the USE1 region, our RADIUS authentication service experienced an unexpected outage. Metrics indicated there was an underlying problem, and our engineering team was immediately engaged. To ensure service stability, the upgrade was rolled back, successfully restoring all services to their normal state.

Root Cause:

The outage was triggered by a critical application failure that only manifested under the high-concurrency and heavy traffic conditions of our production environment. Our post-incident analysis identified multiple contributing factors:

  • Application Crash: The direct cause was a critical memory error (SIGSEGV or Segmentation Fault) in the FreeRADIUS application, resulting in its termination with exit code 139. This memory violation was triggered by a suspected underlying incompatibility between the application's dependencies and the new operating system, likely stemming from differences in how the new environment handles memory management or system libraries.
  • Load Sensitivity: The incompatibility was not detected during staging and pre-production testing. Post-revert analysis revealed that the application on the upgraded OS experienced significant spikes in CPU and memory utilization. This increased resource consumption became critical only when exposed to the high volume and concurrency of production user traffic, which ultimately triggered the memory access violation.

Corrective Actions / Risk Mitigation:

  1. Immediately revert the operating system upgrade. - DONE
  2. Increased alerting and monitoring at this layer - DONE
  3. Update our performance testing environment to more accurately simulate the peak traffic patterns of production with more RADIUS protocol variants - IN PROGRESS
  4. Deeper dependency analysis of the FreeRADIUS application and its libraries on the upgraded OS - IN PROGRESS
  5. Review of our maintenance procedures ensuring traffic volume is analyzed and maintenance windows are optimal - IN PROGRESS
Posted Jun 12, 2025 - 14:41 MDT

Resolved

This incident has been resolved.
Posted Jun 10, 2025 - 10:22 MDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 10, 2025 - 09:57 MDT

Identified

The issue with RADIUS has been identified and we are working on a fix.
Posted Jun 10, 2025 - 09:20 MDT

Investigating

We're currently investigating reports of issues with JumpCloud's RADIUS-as-a-Service. We are investigating the cause of the issues.
Posted Jun 10, 2025 - 09:13 MDT
This incident affected: RADIUS.