Platform Latency in Prod1
Incident Report for Kustomer
Postmortem

Summary

On June 13th, 2023 at 3:00 PM EDT, our engineering team was alerted of increased errors in Lambda functions hosted on the AWS. AWS published a status update at 3:08 PM EDT regarding a system wide issue that was causing these errors. These Lambda functions have various responsibilities with their primary user-facing purpose being to index search results and audit log events.

Once the issue was acknowledged to be contained to the us-east-1 region, recovery measures were initiated in anticipation of a broader AWS outage. AWS resolved the issue on their end at 4:48 PM EDT. After resolving, Lambdas resumed processing data and all new search data were populated by 5:44 PM EDT

In addition, AWS reported increased error rates for Amazon Connect during this outage.

Root Cause

Search results were delayed and not populating in prod1 from an AWS outage affecting Lambdas in us-east-1

Timeline

June 13, 2023

3:00 PM EDT On call engineers were paged about increased errors in an AWS Lambda function and began investigating. At this time, the AWS console was also not accessible, causing a small delay in troubleshooting.

3:08 PM EDT AWS publishes their first update reporting increased error rates and latencies on their status page.

3:15 PM EDT On-call engineers determined that Lambdas in prod1 are degraded or not functioning. 

3:19 PM EDT AWS publishes an update reporting that Lambda functions were experiencing elevated error rates.

3:21 PM EDT  Pre-emptive efforts began to transition to a different region.

3:26 PM EDT AWS reports that they've identified the root cause of increased errors in AWS Lambda functions, and they're working to resolve it.

3:55 PM EDT Kustomer Statuspage update for incident is published.

4:05 PM EDT AWS status page update regarding Amazon Connect errors: “We are experiencing degraded contact handling in the US-EAST-1 Region. Callers may fail to connect and chats may fail to initiate. Agents may experience issues logging in or being connected with end-customers.”

4:29 PM EDT Impact to Kustomer systems was contained to search results not updated. Efforts shifted to focus on populating search results with new data.

4:40 PM EDT AWS status page update regarding Amazon Connect errors: “We have identified the root cause of the degraded contact handling in the US-EAST-1 Region. Callers may fail to connect and chats and tasks may fail to initiate. Agents may also experience issues logging in or being connected with end-customers. Mitigation efforts are underway.”

4:48 PM EDT AWS reports that a fix was implemented and services are recovering. Internal metrics show search results are populating.

5:00 PM EDT AWS status page update: "Many AWS services are now fully recovered and marked Resolved on this event. We are continuing to work to fully recover all services."

5:02 PM EDT AWS status page update regarding Amazon Connect errors: “Between 11:49 AM and 1:40 PM PDT, we experienced degraded contact handling in the US-EAST-1 Region. Callers may have failed to connect and chats and tasks may have failed to initiate. Agents may also have experienced issues logging in or being connected with end-customers. The issue has been resolved and the service is operating normally.”

5:29 PM EDT AWS status page update: "Lambda synchronous invocation APIs have recovered. We are still working on processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services (such as SQS and EventBridge). Lambda is working to process these messages during the next few hours and during this time, we expect to see continued delays in the execution of asynchronous invocations."

5:49 PM EDT AWS status page update: "We are working to accelerate the rate at which Lambda asynchronous invocations are processed, and now estimate that the queue will be fully processed over the next hour. We expect that all queued invocations will be executed."

Lessons/Improvements

  • Investigate additional methods and strategies to improve search indexing reliability.
  • Add new alarms for detecting lower than usual traffic to our indexing services. Although our system alerted us to events building up in our event streams, we observed that we would improve our response time if we added some additional alarms related to lower than usual event traffic to our indexers.
Posted Jun 16, 2023 - 17:45 EDT

Resolved
Kustomer has resolved an event causing platform latency and searches not updating in the platform. After careful monitoring, our team has found that all affected areas are fully restored. Please reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted Jun 13, 2023 - 17:56 EDT
Monitoring
Kustomer has received an update that addresses the event affecting searches not updating within the platform. Our team will continue to monitor this update to ensure the issue is fully resolved. Please expect further updates within 1 hour and reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted Jun 13, 2023 - 17:03 EDT
Identified
Kustomer is aware of an event that may cause searches to not update within the platform. Our team is currently working to resolve this issue in an effort to implement a resolution. Please expect further updates within 1 hour and reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted Jun 13, 2023 - 16:27 EDT
Investigating
Kustomer has identified an event that may cause Platform Latency across prod1. Our team is currently working to design and implement a resolution. Please expect further updates within 30 minutes and reach out to support at support@kustomer.com if you have additional questions or concerns. This appears to be related to degradation in performance in AWS https://health.aws.amazon.com/health/status
Posted Jun 13, 2023 - 15:55 EDT
This incident affected: Prod1 (US) (Events / Audit Log, Search).