On October 20, 2025, Kustomer experienced a significant service disruption affecting all customers on our platform. The incident was triggered by a major AWS service disruption in the us-east-1 region that impacted ~140 AWS services, including core infrastructure components our platform depends on.
The incident occurred in two distinct phases:
Full platform recovery was achieved by 5:00 PM EDT on October 20, with final cleanup operations completed by 7:00 PM EDT. A remaining issue with AIC and AIR observability infrastructure was fully resolved by 11:45 AM EDT on October 21.
Root Cause
Primary Cause: AWS Regional Service Disruption
The incident originated from a widespread AWS service disruption in the us-east-1 region. According to AWS, the root cause was a DNS resolution failure for internal service endpoints, specifically affecting DynamoDB regional endpoints. This DNS issue cascaded into AWS's EC2 internal launch system, creating widespread connectivity problems across the region.
Secondary Cause: Disaster Recovery Automation Issue
During our disaster recovery response, an infrastructure configuration issue emerged that prevented proper authentication for approximately 4.5 hours. When disaster recovery preparations were initiated at 11:25 AM EDT, our infrastructure automation discovered that primary and secondary region configurations shared the same management context. As the automation began provisioning resources in the secondary region, it simultaneously triggered scaling changes in the primary region's authentication service, reducing it to a minimal operational state.
Under normal circumstances, an automated deployment would have immediately corrected this configuration. However, this deployment failed due to the ongoing AWS service disruptions, leaving the authentication service's load balancer in an inconsistent state—pointing to a configuration with insufficient capacity while healthy instances remained unreachable.
This was an edge case that required the specific combination of: (1) initiating disaster recovery operations, (2) the infrastructure management coupling between regions that were partially operational (us-east-1 was significantly degraded but remained operational), and (3) simultaneous AWS service failures in us-east-1 preventing the standard recovery mechanism from completing. Had us-east-1 been completely offline, there would not have been extra consideration required for operations in that region.
All times in Eastern Daylight Time
Oct 20, 2025
3:11 AM - Multiple alerts triggered indicating elevated error rates across platform services. Kustomer’s Statuspage was accessible but oncall responders could not authenticate successfully to provide updates.
3:18 AM - Incident response initiated; engineers begin investigation
4:25 AM - Confirmed correlation with AWS service disruptions in us-east-1 region
5:01 AM - AWS identifies root cause of DNS resolution failures
5:27 AM - 6:03 AM - AWS reports significant recovery; Kustomer platform functionality observed to be restored. Access to Kustomer’s Statuspage was restored at 5:37am.
6:30 AM - Initial recovery verified; platform communications channels tested successfully. Kustomer platform is verified fully functional
8:55 AM - New wave of elevated error rates detected across multiple services
9:42 AM - AWS announces mitigation in progress with EC2 throttling. Additional compute is significantly limited and affects autoscaling operations.
10:11 AM - Severity escalated; separate incident channel created
10:40 AM - Decision made to initiate disaster recovery preparations in us-east-2 region as contingency
11:02 AM - Engineers begin provisioning infrastructure in secondary region
11:46 AM - Due to authentication issues, Kustomer Technical Support began responding to client inquiries directly through email to continue correspondence
11:47 AM - 3:48 PM Multiple customer reports of authentication failures; disaster recovery efforts are taking place during this time. At 3:48pm, the root cause for authentication errors: load balancer configuration prematurely routing traffic to secondary region
4:03 PM - AWS reports ongoing recovery across most services
4:15 PM - Load balancer configuration corrected
4:30 PM - Customer authentication restored; users able to access platform
5:00 PM - API traffic normalized; error rates return to baseline
5:48 PM - AWS announces full recovery of regional services in us-east-1
6:01 PM - 6:35 PM - Queues redriven to recover delayed operations
7:00 PM - Secondary region infrastructure fully prepared for future failover needs
8:40 PM - Observability logs for AIC and AIR feature remain impacted due to Opensearch Serverless recovery mitigations implemented by AWS
Oct 21, 2025
11:45 AM - Full recovery of Opensearch Serverless which restores AIC and AIR observability
Lessons/Improvements
Disaster recovery readiness: Our recent disaster recovery exercise was most recently conducted in July 2025. The updated documentation proved valuable during the incident. The team was able to reference established runbooks and procedures, even as we encountered unexpected challenges.
Cross-team coordination: Engineers across multiple teams collaborated effectively to simultaneously address recovery in the primary region while preparing failover capabilities in the secondary region.
Iterative improvement: The team identified and implemented fixes to disaster recovery automation in real-time, improving our processes even during the incident.
Customer communication: Our status page experienced authentication issues during the incident, limiting our ability to communicate with customers. Additionally, Kustomer’s own access to the platform was impacted from this service disruption. Our tech support and customer success teams were able to keep customers updated by reaching out via the email channel.
Earlier detection of infrastructure anomalies: The authentication service load balancer inconsistency was observed earlier in the incident but wasn't immediately prioritized for investigation amid numerous other AWS-related issues. We're implementing additional health checks and automated detection to ensure redundant and proactive alerting for this critical functionality as well as updating runbooks to reinforce the importance of investigating the minor anomalies during major incidents.
Status page reliability: The ability to keep customers updated to the latest status about the Kustomer platform is critical to our operations and maintaining trust. We are exploring options to improve the reliability of this tool.
Immediate Actions (Completed or In Progress):
Strategic Initiatives:
This incident reinforced our commitment to building a resilient platform. While we cannot prevent cloud provider service disruptions, we can minimize their impact through better disaster recovery capabilities, more robust automation safeguards, and regular testing of our failover procedures.
We recognize that our customers depend on Kustomer for critical business operations, and we take that responsibility seriously. The improvements outlined above represent concrete steps toward faster recovery times and reduced impact from future infrastructure disruptions.
We appreciate the patience of our customers during this incident and remain committed to continuous improvement of our platform's reliability and resilience.