
Date: Nov 21, 2025
Date of Incident: Nov 19, 2025
Description: RCA for Admin Portal Login Errors
Summary:
On November 19, 2025, starting at approximately 04:30 UTC, between 1-5% of requests experienced intermittent failures to successfully authenticate to the Admin Console, lasting until roughly 06:30 UTC. Users attempting to authenticate received an “unexpected” error message during this window, but subsequent retries may have been successful.
Root Cause:
This issue was triggered during a standard infrastructure update and traffic shift intended to move services to a new, updated cluster. The core issue was a combination of an infrastructure configuration mismatch and gaps in our detection and validation processes.
- Configuration Drift: The new infrastructure cluster (Green Cluster), intended to host the service, was missing a single but essential configuration value used by the control plane’s service mesh. This value had been recently applied to the existing cluster (old cluster) but was inadvertently excluded when the new cluster's baseline configuration was created and branched. When production traffic began routing to the new cluster, the missing configuration caused some access components to fail, leading to the login errors.
- Detection Gaps: The application logged the configuration failure as a Warning message, rather than a critical error. This meant our automated monitoring system did not trigger an immediate alert or rollback when the issue first occurred.
The team quickly isolated the issue to the new Green Cluster, and an emergency process was initiated to immediately revert all production traffic back to the stable old cluster.
Corrective Actions / Risk Mitigation:
- Automated Configuration Diff Check - Implementing an automated process to continuously compare and ensure 100% configuration parity between old and new production clusters during all transition phases.
- Clear Rule Enforcement - Reinforcing and automating the process to ensure all configuration changes are applied consistently across all active and future clusters.
- Multi-Layer Error Monitoring - Implementing error rate monitoring at every layer of the network and application stack to ensure no failure goes undetected.