Degraded PUSH Service iOS
Incident Report for JumpCloud
Postmortem

JumpCloud Incident Report

Date: 2024-08-23

Date of Incident: 2024-08-20

Description: RCA for JumpCloud Push Notifications

Summary:

On August 20th, at 3:46 AM Mountain Time, the APNs certificate attached to the AWS SNS Platform Application for Apple expired, causing Apple devices to fail to receive or acknowledge push notifications. Additionally, new device registrations began failing. Already registered devices could authenticate using the TOTP code in the JumpCloud Protect app. The on-call team was paged, and the engineer started investigating. An internal incident was called at 4:31 AM, and responders from different teams began triaging MFA failures. The team identified the errors from SNS and escalated for additional support at 4:53 AM. At 5:25 AM, the team attempted a manual rotation of the certificate in SNS, but due to access restrictions, IaC configuration was needed. This discovery and limitation delayed the overall recovery window. By 5:59 AM, the Notification Service had uptaken the new APNs certificate, and a rolling restart was initiated. The APNs certificate was updated in the AWS SNS Platform Application by 6:10 AM, and push notifications began working again.

Root Cause:

The root cause of the incident was the expiration of the APNs certificate. This led to the failure of Apple devices to receive or acknowledge push notifications and the failure of new device registrations.

Corrective and Preventative Actions:

Immediate Corrective Actions:

  • Manual rotation of the expired APNs certificate in the JumpCloud Notification Service.
  • Update of the APNs certificate in AWS SNS Platform Application.

Preventative Actions:

  • Implement monitoring to alert the team well in advance of certificate expiration dates for SNS Platform Applications.
  • Review all critical certificates to ensure monitoring coverage.
  • Automate the certificate renewal process where possible to reduce the risk of human error.
  • Conduct training sessions for the team on handling certificate expirations and renewals.
Posted Aug 23, 2024 - 10:36 MDT

Resolved
This incident has been resolved.
Posted Aug 20, 2024 - 06:21 MDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 20, 2024 - 06:16 MDT
Identified
The issue has been identified and a fix is being implemented.
Posted Aug 20, 2024 - 05:30 MDT
Investigating
We're currently investigating reports of degraded performance or intermittent connectivity issues with JumpCloud's PUSH Services on iOS.
Posted Aug 20, 2024 - 05:04 MDT
This incident affected: TOTP / MFA / JumpCloud Protect.