Date: Jun 12, 2025
Date of Incident: Jun 10, 2025
Description: RCA for USE1 RADIUS Service Disruption
Summary:
On June 10th, between approximately 13:25 UTC and 15:30 UTC, following a planned operating system upgrade in the USE1 region, our RADIUS authentication service experienced an unexpected outage. Metrics indicated there was an underlying problem, and our engineering team was immediately engaged. To ensure service stability, the upgrade was rolled back, successfully restoring all services to their normal state.
Root Cause:
The outage was triggered by a critical application failure that only manifested under the high-concurrency and heavy traffic conditions of our production environment. Our post-incident analysis identified multiple contributing factors:
SIGSEGV
or Segmentation Fault) in the FreeRADIUS application, resulting in its termination with exit code 139
. This memory violation was triggered by a suspected underlying incompatibility between the application's dependencies and the new operating system, likely stemming from differences in how the new environment handles memory management or system libraries.
Corrective Actions / Risk Mitigation: