[AIR and AIC] AI Agent Observability Is Not Ingesting (PROD1)

Incident Report for Kustomer

Postmortem

Summary

On October 20, 2025, Kustomer experienced a significant service disruption affecting all customers on our platform. The incident was triggered by a major AWS service disruption in the us-east-1 region that impacted ~140 AWS services, including core infrastructure components our platform depends on.

The incident occurred in two distinct phases:

  • Phase 1 (3:00 AM - 6:30 AM EDT): Initial AWS service disruption causing widespread connectivity issues and elevated error rates across our platform. Services began recovering as AWS addressed the underlying DNS resolution issues.
  • Phase 2 (9:20 AM - 5:00 PM EDT): Secondary wave of issues, including a critical authentication problem that prevented users from logging into the platform from 11:47 AM to 4:30 PM EDT. This simultaneously overlaps with the disruption to AWS EC2 service on October 20 2:48am EDT to 4:50pm EDT. This authentication issue was caused by our disaster recovery automation during our attempts to establish failover capabilities while the AWS EC2 disruption was ongoing.

Full platform recovery was achieved by 5:00 PM EDT on October 20, with final cleanup operations completed by 7:00 PM EDT. A remaining issue with AIC and AIR observability infrastructure was fully resolved by 11:45 AM EDT on October 21.

Root Cause

Primary Cause: AWS Regional Service Disruption

The incident originated from a widespread AWS service disruption in the us-east-1 region. According to AWS, the root cause was a DNS resolution failure for internal service endpoints, specifically affecting DynamoDB regional endpoints. This DNS issue cascaded into AWS's EC2 internal launch system, creating widespread connectivity problems across the region.

Secondary Cause: Disaster Recovery Automation Issue

During our disaster recovery response, an infrastructure configuration issue emerged that prevented proper authentication for approximately 4.5 hours. When disaster recovery preparations were initiated at 11:25 AM EDT, our infrastructure automation discovered that primary and secondary region configurations shared the same management context. As the automation began provisioning resources in the secondary region, it simultaneously triggered scaling changes in the primary region's authentication service, reducing it to a minimal operational state.

Under normal circumstances, an automated deployment would have immediately corrected this configuration. However, this deployment failed due to the ongoing AWS service disruptions, leaving the authentication service's load balancer in an inconsistent state—pointing to a configuration with insufficient capacity while healthy instances remained unreachable.

This was an edge case that required the specific combination of: (1) initiating disaster recovery operations, (2) the infrastructure management coupling between regions that were partially operational (us-east-1 was significantly degraded but remained operational), and (3) simultaneous AWS service failures in us-east-1 preventing the standard recovery mechanism from completing. Had us-east-1 been completely offline, there would not have been extra consideration required for operations in that region.

Timeline

All times in Eastern Daylight Time

Initial Service Disruption Phase

Oct 20, 2025

3:11 AM - Multiple alerts triggered indicating elevated error rates across platform services. Kustomer’s Statuspage was accessible but oncall responders could not authenticate successfully to provide updates.

3:18 AM - Incident response initiated; engineers begin investigation

4:25 AM - Confirmed correlation with AWS service disruptions in us-east-1 region

5:01 AM - AWS identifies root cause of DNS resolution failures

5:27 AM - 6:03 AM - AWS reports significant recovery; Kustomer platform functionality observed to be restored. Access to Kustomer’s Statuspage was restored at 5:37am.

6:30 AM - Initial recovery verified; platform communications channels tested successfully. Kustomer platform is verified fully functional

Secondary Service Disruption Phase

8:55 AM - New wave of elevated error rates detected across multiple services

9:42 AM - AWS announces mitigation in progress with EC2 throttling. Additional compute is significantly limited and affects autoscaling operations.

10:11 AM - Severity escalated; separate incident channel created

10:40 AM - Decision made to initiate disaster recovery preparations in us-east-2 region as contingency

11:02 AM - Engineers begin provisioning infrastructure in secondary region

11:46 AM - Due to authentication issues,  Kustomer Technical Support began responding to client inquiries directly through email to continue correspondence

11:47 AM - 3:48 PM Multiple customer reports of authentication failures; disaster recovery efforts are taking place during this time. At 3:48pm, the root cause for authentication errors: load balancer configuration prematurely routing traffic to secondary region

4:03 PM - AWS reports ongoing recovery across most services

4:15 PM - Load balancer configuration corrected

4:30 PM - Customer authentication restored; users able to access platform

5:00 PM - API traffic normalized; error rates return to baseline

5:48 PM - AWS announces full recovery of regional services in us-east-1

6:01 PM - 6:35 PM - Queues redriven to recover delayed operations

7:00 PM - Secondary region infrastructure fully prepared for future failover needs

Extended Recovery

8:40 PM - Observability logs for AIC and AIR feature remain impacted due to Opensearch Serverless recovery mitigations implemented by AWS

Oct 21, 2025

11:45 AM - Full recovery of Opensearch Serverless which restores AIC and AIR observability

Lessons/Improvements

What Went Well

Disaster recovery readiness: Our recent disaster recovery exercise was most recently conducted in July 2025. The updated documentation proved valuable during the incident. The team was able to reference established runbooks and procedures, even as we encountered unexpected challenges.

Cross-team coordination: Engineers across multiple teams collaborated effectively to simultaneously address recovery in the primary region while preparing failover capabilities in the secondary region.

Iterative improvement: The team identified and implemented fixes to disaster recovery automation in real-time, improving our processes even during the incident.

Customer communication: Our status page experienced authentication issues during the incident, limiting our ability to communicate with customers. Additionally, Kustomer’s own access to the platform was impacted from this service disruption. Our tech support and customer success teams were able to keep customers updated by reaching out via the email channel.

Areas for Improvement

Earlier detection of infrastructure anomalies: The authentication service load balancer inconsistency was observed earlier in the incident but wasn't immediately prioritized for investigation amid numerous other AWS-related issues. We're implementing additional health checks and automated detection to ensure redundant and proactive alerting for this critical functionality as well as updating runbooks to reinforce the importance of investigating the minor anomalies during major incidents.

Status page reliability: The ability to keep customers updated to the latest status about the Kustomer platform is critical to our operations and maintaining trust. We are exploring options to improve the reliability of this tool.

Planned Improvements

Immediate Actions (Completed or In Progress):

  • Exploring redundant customer communication channels beyond our primary status page
  • Separated infrastructure management for secondary region to enable independent deployment and faster disaster recovery
  • Adding safeguards to disaster recovery automation and race conditions to prevent unintended traffic shifts and validate target health before routing changes

Strategic Initiatives:

  • Conducting more frequent disaster recovery exercises to validate our failover procedures under realistic conditions
  • Developing faster, more reliable automation scripts and orchestration tools for disaster recovery operations to reduce manual intervention and accelerate failover execution
  • Establishing disaster recovery readiness as a mandatory requirement in our engineering standards, ensuring all new features and services are designed with multi-region capabilities and tested regularly for failover scenarios

Commitment to Reliability

This incident reinforced our commitment to building a resilient platform. While we cannot prevent cloud provider service disruptions, we can minimize their impact through better disaster recovery capabilities, more robust automation safeguards, and regular testing of our failover procedures.

We recognize that our customers depend on Kustomer for critical business operations, and we take that responsibility seriously. The improvements outlined above represent concrete steps toward faster recovery times and reduced impact from future infrastructure disruptions.

We appreciate the patience of our customers during this incident and remain committed to continuous improvement of our platform's reliability and resilience.

Posted Oct 29, 2025 - 13:28 EDT

Resolved

Kustomer has resolved an event affecting AI Agent Observability (AIC and AIR) on Prod1 that may prevent new observability traces from ingesting into the platform. To resolve this issue, our team has worked with our cloud provider to address the delays.

After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Oct 21, 2025 - 14:33 EDT

Monitoring

Kustomer has worked with our cloud provider to address the delays on AIC and AIR metrics ingestion on Prod1, related to the ongoing AWS OpenSearch Serverless service disruption.

Our Engineering teams are continuing to monitor recovery progress, and we’ll continue to share further updates as new information becomes available.

If you have any questions or concerns, please contact Kustomer Support at support@kustomer.com.
Posted Oct 21, 2025 - 13:39 EDT

Update

Kustomer continues to work with our cloud provider to address the delays on AIC and AIR metrics ingestion on Prod1, related to the ongoing AWS OpenSearch Serverless service disruption.

Our Engineering teams are monitoring recovery progress. We are experiencing incremental recovery, and we should continue to see recovery as time passes. We’ll continue to share further updates as new information becomes available.

If you have any questions or concerns, please contact Kustomer Support at support@kustomer.com.
Posted Oct 21, 2025 - 11:42 EDT

Update

Kustomer continues to work with our cloud provider to address the delays on AIC and AIR metrics ingestion on Prod1, related to the ongoing AWS OpenSearch Serverless service disruption.

Our Engineering teams are monitoring recovery progress, and we’ll continue to share further updates as new information becomes available.

If you have any questions or concerns, please contact Kustomer Support at support@kustomer.com.
Posted Oct 21, 2025 - 06:52 EDT

Update

We’re continuing to work with our cloud provider to resolve the delays affecting AIC and AIR metrics ingestion on Prod1, related to the ongoing AWS OpenSearch Serverless service disruption.

Our Engineering teams are closely monitoring recovery progress and we will continue to share updates as more information becomes available.

If you have any questions or concerns, please contact Kustomer Support at support@kustomer.com.
Posted Oct 21, 2025 - 02:37 EDT

Update

Kustomer is aware of delays affecting AIC and AIR metrics ingestion on Prod1 due to an ongoing AWS OpenSearch Serverless service disruption. We’re working with our cloud provider toward recovery. We continue to monitor the situation and will share additional updates as more information becomes available. For any questions or concerns, please contact support@kustomer.com.
Posted Oct 21, 2025 - 01:02 EDT

Update

Kustomer is aware of delays affecting AIC and AIR metrics ingestion on Prod1 due to an ongoing AWS OpenSearch Serverless service disruption. We’re working with our cloud provider toward recovery, which is currently expected within the next two hours. We continue to monitor the situation and will share additional updates as more information becomes available. For any questions or concerns, please contact support@kustomer.com.
Posted Oct 20, 2025 - 23:08 EDT

Update

Kustomer has observed an issue affecting AI Agent Observability (AIC and AIR) on Prod1, resulting in new observability traces not being ingested into the platform. Existing traces remain accessible.

Our team continues to work toward a resolution. Please expect additional updates within the next 30 minutes, and reach out to Kustomer Support at support@kustomer.com for any questions or concerns.
Posted Oct 20, 2025 - 22:44 EDT

Identified

Kustomer has identified an event in AI Agent Observability (AIC and AIR) on Prod1 that may prevent new observability traces from ingesting into the platform. Existing traces remain available.

Our team is working to implement a resolution. Please expect additional updates within the next 30 minutes, and reach out to Kustomer Support at support@kustomer.com for any questions or concerns.
Posted Oct 20, 2025 - 21:57 EDT
This incident affected: Prod1 (US) (Analytics, Channel - Chat).