On June 13th, 2023 at 3:00 PM EDT, our engineering team was alerted of increased errors in Lambda functions hosted on the AWS. AWS published a status update at 3:08 PM EDT regarding a system wide issue that was causing these errors. These Lambda functions have various responsibilities with their primary user-facing purpose being to index search results and audit log events.
Once the issue was acknowledged to be contained to the us-east-1 region, recovery measures were initiated in anticipation of a broader AWS outage. AWS resolved the issue on their end at 4:48 PM EDT. After resolving, Lambdas resumed processing data and all new search data were populated by 5:44 PM EDT
In addition, AWS reported increased error rates for Amazon Connect during this outage.
Root Cause
Search results were delayed and not populating in prod1 from an AWS outage affecting Lambdas in us-east-1
June 13, 2023
3:00 PM EDT On call engineers were paged about increased errors in an AWS Lambda function and began investigating. At this time, the AWS console was also not accessible, causing a small delay in troubleshooting.
3:08 PM EDT AWS publishes their first update reporting increased error rates and latencies on their status page.
3:15 PM EDT On-call engineers determined that Lambdas in prod1 are degraded or not functioning.
3:19 PM EDT AWS publishes an update reporting that Lambda functions were experiencing elevated error rates.
3:21 PM EDT Pre-emptive efforts began to transition to a different region.
3:26 PM EDT AWS reports that they've identified the root cause of increased errors in AWS Lambda functions, and they're working to resolve it.
3:55 PM EDT Kustomer Statuspage update for incident is published.
4:05 PM EDT AWS status page update regarding Amazon Connect errors: “We are experiencing degraded contact handling in the US-EAST-1 Region. Callers may fail to connect and chats may fail to initiate. Agents may experience issues logging in or being connected with end-customers.”
4:29 PM EDT Impact to Kustomer systems was contained to search results not updated. Efforts shifted to focus on populating search results with new data.
4:40 PM EDT AWS status page update regarding Amazon Connect errors: “We have identified the root cause of the degraded contact handling in the US-EAST-1 Region. Callers may fail to connect and chats and tasks may fail to initiate. Agents may also experience issues logging in or being connected with end-customers. Mitigation efforts are underway.”
4:48 PM EDT AWS reports that a fix was implemented and services are recovering. Internal metrics show search results are populating.
5:00 PM EDT AWS status page update: "Many AWS services are now fully recovered and marked Resolved on this event. We are continuing to work to fully recover all services."
5:02 PM EDT AWS status page update regarding Amazon Connect errors: “Between 11:49 AM and 1:40 PM PDT, we experienced degraded contact handling in the US-EAST-1 Region. Callers may have failed to connect and chats and tasks may have failed to initiate. Agents may also have experienced issues logging in or being connected with end-customers. The issue has been resolved and the service is operating normally.”
5:29 PM EDT AWS status page update: "Lambda synchronous invocation APIs have recovered. We are still working on processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services (such as SQS and EventBridge). Lambda is working to process these messages during the next few hours and during this time, we expect to see continued delays in the execution of asynchronous invocations."
5:49 PM EDT AWS status page update: "We are working to accelerate the rate at which Lambda asynchronous invocations are processed, and now estimate that the queue will be fully processed over the next hour. We expect that all queued invocations will be executed."
Lessons/Improvements