Reported Third Party Event - Amazon

Incident Report for Kustomer

Postmortem

Summary

On Tuesday, December 7th, 2021 the Kustomer application was unavailable on our US Prod 1 POD due to a major AWS incident. Inbound and outbound messages, workflows, business rules, Queues & Routing, and search were heavily impacted for all customers in this environment, rendering Kustomer mostly inoperable.

In terms of impact, at around 10:45 AM ET through 11:22 AM ET, customers were experiencing extreme delays sending and receiving messages and saw stale data across various parts of the platform like searches and timelines. From 11:22 PM ET through 5:33 PM ET the system further degraded, preventing all inbound and outbound messages from being sent, timelines from rendering, events from processing, and in some cases, customers from logging into our web app. After various attempts at restoration and a few partial recoveries, the system was finally brought to a healthy state by 8:22 PM ET and all deadlettered items were re-driven by 11:17 PM ET. The system was inoperable for about 6 hours.

The incident we experienced was extreme and re-surfaced the need to focus our efforts to improve our cross-regional redundancy. We rely very heavily on AWS infrastructure and have applied common best practices to run each production POD across multiple Availability Zones (Multi-AZ architecture) within a region to deliver resilience and ensure continuous availability in the event of an Availability Zone failure. However, we did not have cross-regional redundancy in place to prevent the regional failure we saw on Tuesday, Dec 7th.

During this incident, multiple AWS APIs in the US-EAST-1 region were impacted by issues with several network devices affecting multiple redundancy points in the same region. Since our system heavily relies on the APIs affected, this caused moderate and then significant latency that led to the Kustomer application becoming unresponsive for customers on our Prod 1 Production POD.

We tried to deploy backend services but were unable to do so because our deployment tools were impacted by AWS ECR, preventing us from being able to register and pull images down from the registry. We tried to start new containers to get into a healthy state but this was also affected by AWS SSM being down. We attempted to start containers for Standard Objects API and other services to allow them to scale up, but this was also impacted by the AWS EC2 issues.

By midday we started to explore working around the individual AWS Service issues, first enabling SSM in US-EAST-2. Our team also started working on a manual deployment of Standard Objects API Service to Prod 1 POD but we had to reassess our technical options since the AWS APIs were not acting as expected when trying to create new ECS instances as they were stuck in an unhealthy state.

After a few hours of hitting a number of roadblocks attempting to circumvent the impacted AWS services within US-EAST-1, the team started to pursue a new path of moving application services to an entirely new ECS cluster in the US-EAST-2 region within Prod 1 POD.

By 5:33 PM ET the team noticed that EC2 APIs were beginning to recover and were able to start the Standard Objects API services again. Conversations were starting to appear in the web application and searches were returning results. We shifted our focus from the workarounds we were solutioning to aiding in the recovery of the platform by scaling containers and database clusters to help keep up with the inbound traffic.

It took a few hours for the system to process the backlog of events queued, which delivered a more significant load than usual and resulted in additional latency during the recovery. The team increased capacity to handle this increased traffic. By 9:00 PM ET all events in our largest queue finished processing and latency dropped to normal thresholds.

Once the backlog of events completed processing, we upgraded our API and Workflow Caching clusters to reduce latency further and started our re-driving efforts for all dead letter queues. By 11:17 PM ET all events within dead letter queues were re-driven.

What We Learned

There were many things that went well along the way that we want to continue to do and many things that we can and need to improve to ensure that our application remains unaffected by future regional incidents.

What Went Well

Our alerting worked as expected. We were notified right away that there was a problem so we could quickly react to it.
The business continuity changes we made helped us stay in communication with our customers even though our support team was also affected by this event.
The team had many creative ideas to work around the limitations that were presented from the degradation of AWS and its various tools.
Due to improvements we’ve made, we were able to reuse tools we created in the past to quickly re-drive dead letter queues once the system returned to a healthy state.
Everyone chipped in to lend their expertise and work towards our singular goal of system recovery.

Areas for Improvement

We need to work to prioritize multi-region fault tolerance projects to avoid issues like this going forward.
We need to put additional investment into chaos engineering for a more proactive approach to improving our application fault-tolerance and DR planning.
We need to automate some of the manual procedures around re-driving dead letter queues in order to reduce effort and restore full functionality more quickly.
We should explore self-healing that we can build into the application to make us more resilient.
Some of the workflows hit the API limits and created errors. We want to explore ways to improve the platform and minimize these kinds of errors going forward.
We should improve our documentation of system dependencies and recovery steps so we could look for more automation improvements going forward.

Action Items

Since the regrettable event on Dec 7th, the team has taken the time to understand what went wrong and will fast-track efforts to greatly expand the fail-safes that were already in place. The team is committed to improving the resiliency of the platform.

We’ll begin by initiating a comprehensive review of our systems to identify the shortcomings in our current redundancies. While this is a multi-phase effort that we’ll continue to iterate on over time, we expect to be able to start seeing the benefits from this over the next few months. The end goal of this effort will be a platform that is increasingly resilient and greatly reduces the time it takes for the system to recover.

Here are a few things we’ll be doing in the short to long term to increase availability during incidents:

Short Term:

Review our systems to identify the opportunities for improvement specific to multi-region availability.
Review current strategies for mitigating various incidents that would affect the many layers of our systems such as CDN, compute, and data, and establish procedures for any existing gaps.
Evaluate supplemental regions to utilize in addition to our existing primary and secondary regions.
Provision additional resources in backup regions.
Improve system recovery documentation and automate manual procedures that reduce effort and enable us to bridge the gap between our RTO and nearly real-time RPO.

Long Term:

Implement chaos testing principles to rigorously test the system’s various redundancies.
Enable recurring testing and simulation cadence of our multi-region availability system.

Timeline of Events

All times shown are in Eastern Standard Time.

12/07 10:45am - We received an alert indicating that there was elevated latency and errors. Outbound emails appeared to be sending, but were very delayed.
12/07 10:56am - Inbound emails were being received but at a delayed rate. Customers reported that chats were not sending outbound messages.
12/07 11:20am - We attempted to move to a new region but our deployment service was affected by the incident.
12/07 11:22am - AWS reports increased error rates in their APIs and the console for us-east-1.
12/07 11:46am - Latency and error rates across services started to go down.
12/07 11:49am - AWS reported elevated error rates for various APIs in the us-east-1 region. They have identified the root cause and are working on a solution.
12/07 12:26pm - We started to see more requests succeeding and latency decreasing.
12/07 1:12pm - AWS posts that they are starting to see some signs of recovery. The influx of traffic causes our web application to become unresponsive.
12/07 1:47pm - The Kustomer Team starts working on manual deployment to an unaffected region.
12/07 2:26pm - AWS reports further degradation of API services in the US-EAST-1 Region. They report the root cause of this issue is an impairment of several network devices in the US-EAST-1 Region.
12/07 5:04pm - AWS reports that they have executed mitigation which is showing significant recovery in the US-EAST-1 Region.
12/07 5:33pm - Kustomer Team notices that significant recovery is beginning to start. Conversations are beginning to appear in our web application and searches are returning results.
12/07 5:47pm - AWS reports that they have mitigated the underlying issue that caused some network devices in the US-EAST-1 Region to be impaired and are seeing improvement in availability across most AWS services.
12/07 6:08pm - Our system experiences increased traffic and latency during recovery
12/07 6:24pm - We launch more containers to keep up with traffic.
12/07 8:15pm - We see an uptick in errors and latency across all backend services. Kustomer app is not loading.
12/07 8:22pm - Latency improves and the backlog of events continues processing.
12/07 8:45pm - Kustomer app begins to load slowly again.
12/07 9:00pm - Finished processing all events and latency drops to normal thresholds.
12/07 9:17pm - Upgrade API and Workflow Caching to new clusters to resolve latency concerns.
12/07 9:27pm - Finished rolling out new caching clusters.
12/07 9:29pm - Web application latency is higher due to a cold cache.
12/07 9:36pm - All services and workers are fully operational.
12/07 9:44pm - Begin to re-drive deadlettered events for all queues.
12/07 10:28pm - Backend services using new clusters are fully cutover.
12/07 11:17pm - All deadlettered events re-driven.

Posted Dec 14, 2021 - 17:25 EST

Resolved

We have completed re-driving all dead-lettered events. System functionality should be back to normal now. Please reach out to support@kustomer.com if you continue to experience any issues or have any additional questions.

-------------------------

We are continuing to monitor Kustomer following the earlier AWS incident. All backlogged Kustomer events have now been processed and services are back to normal.

We have started the process of re-driving all dead-lettered queues that will retry incoming messages and events that were received but have not been processed.

All outgoing/pending messages have already been sent.

Please reach out to support@kustomer.com if you have any questions.

Posted Dec 07, 2021 - 22:18 EST

Update

We are continuing to monitor Kustomer as AWS's system and processes recover. We are seeing improvements in our system as AWS continues to push out fixes.

We have caught up to our backlog of workflows and business rules. They are now processing in real time. We are still continuing to monitor that area.

We are also seeing intermittent latency in other areas of the platform including loading with searches or timelines. In some instances, logging into Kustomer may also be delayed.

We will continue to provide updates and continue working towards full restoration of all events and processes.

Posted Dec 07, 2021 - 21:22 EST

Update

We are continuing to monitor Kustomer as AWS's system and processes recover. We are seeing improvements in our system as AWS continues to push out fixes.

We are experiencing a large back log of workflow and business rule events that are causing long delays. We are continuing to investigate this situation and trying to find ways to speed up the processing of this backlog to reduce these residual effects. Workflows and business may run with some delays at this time.

We are also seeing intermittent latency in other areas of the platform including loading with searches or timelines. In some instances, logging into Kustomer may also be delayed.

We will continue to provide updates and continue working towards full restoration of all events and processes.

Posted Dec 07, 2021 - 20:44 EST

Update

AWS has mitigated the underlying issue that caused some network devices in the US-EAST-1 Region to be impaired. They are seeing improvement in availability across most AWS services. All services are now independently working through service-by-service recovery. In order to expedite overall recovery, They have temporarily disabled Event Deliveries for Amazon EventBridge in the US-EAST-1 Region. These events will still be received & accepted, and queued for later delivery.

Services like SSO, Connect, API Gateway, ECS/Fargate, and EventBridge are still experiencing impact. AWS Engineers are actively working on resolving impact to these services.

We are seeing most services restored in Kustomer, with the exception of some workflow items that are still taking some time to process. We are starting our re-driving efforts and will post more information shortly

Posted Dec 07, 2021 - 18:14 EST

Monitoring

AWS has executed a mitigation which is showing significant recovery in the US-EAST-1 Region. Our Engineers are seeing significant recovery as well and are actively working to bring back our US-EAST-1 server back online. We will provide another update as soon as this becomes available.

Posted Dec 07, 2021 - 17:46 EST

Update

We are continuing to experience problems across Kustomer with various parts of the platform including sending/receiving messages, events processing, object updates, automation processing, etc. due to an incident with AWS.

Kustomer engineers are actively working on this issue with AWS and independently to find additional ways to restore functionality as quickly as possible. We do not have an ETA on resolution yet. But will post regular updates as we gather more information.

Posted Dec 07, 2021 - 17:03 EST

Update

We are continuing to experience problems across Kustomer with various parts of the platform including sending/receiving messages, events processing, object updates, automation processing, etc. due to an incident with AWS.

Kustomer engineers are continuing to actively work on finding additional ways to restore functionality as quickly as possible. We do not have an ETA on resolution yet. But will post regular updates as we gather more information.

Posted Dec 07, 2021 - 15:38 EST

Update

We are continuing to experience problems across Kustomer with various parts of the platform including sending/receiving messages, events processing, object updates, automation processing, etc. due to an incident with AWS.

AWS has identified the root cause of this issue as a problem with several network devices within the internal AWS network. The internal DNS servers that provide resolution for the AWS services were specifically impacted by these issues. Amazon has moved traffic for those servers to another set of network devices, which has resolved those DNS issues. However, the network devices for other non-DNS services are still impacted, and Amazon is still actively working to shift more traffic to fully mitigate this issue.

Kustomer engineers are also actively working on this issue to find additional ways to restore functionality as quickly as possible. We do not have an ETA on resolution yet. But will post regular updates as we gather more information.

Posted Dec 07, 2021 - 14:26 EST

Update

Kustomer is aware of an event reported by one of our third-party vendors affecting Prod1 and Prod2 environments that may cause latency of processing events in the platform. Services include sending and receiving inbound integration events, app events, object updates, conversation correspondence, automation processing, Amazon Connect services, and other areas of the Kustomer platform. The scope of this incident has increased and we are currently investigating further as we work towards a solution.

We have also been informed that the vendor has started resolving the issue on their side, and are starting to see some signs of recovery. Please expect further updates within the next hour and reach out to support through email at support@kustomer.com if you have additional questions or concerns.

Posted Dec 07, 2021 - 13:34 EST

Update

Kustomer is aware of an event reported by one of our third-party vendors affecting Prod1 and Prod2 environments that may cause latency of processing events in the platform. Services include sending and receiving inbound integration events, app events, object updates, conversation correspondence, automation processing, Amazon Connect services, and other areas of the Kustomer platform. The scope of this incident has increased and we are currently investigating further as we work towards a solution. We have also been informed that the vendor has identified the issue and is working on a resolution from their side as well. Please expect further updates within the next hour and reach out to support through email at support@kustomer.com if you have additional questions or concerns.

Posted Dec 07, 2021 - 13:15 EST

Update

Kustomer is aware of an event reported by one of our third-party vendors affecting Prod1 and Prod2 environments that may cause latency of processing events in the platform. Services include sending and receiving inbound integration events, app events, object updates, conversation correspondence, automation processing, and Amazon Connect services. The vendor has identified the issue and is working on a resolution. Our team will continue to monitor the status of the vendor. Please expect further updates within the next hour and reach out to support through email at support@kustomer.com if you have additional questions or concerns.

Posted Dec 07, 2021 - 12:50 EST

Update

Kustomer is aware of an event reported by one of our third-party vendors affecting Prod1 and Prod2 environments that may cause latency of processing events in the platform, Amazon Connect services, and the delivery of messages within the platform at a slower pace than expected. The vendor has identified the issue and is working on a resolution. Our team will continue to monitor the status of the vendor. Please expect further updates within the next hour and reach out to support through email at support@kustomer.com if you have additional questions or concerns.

Posted Dec 07, 2021 - 11:58 EST

Identified

Posted Dec 07, 2021 - 11:34 EST

Investigating

Kustomer is aware of an event reported by one of our third-party vendors affecting Prod1 and Prod2 environments that may cause latency of processing conversation changes and the delivery of messages within the platform at a slower pace than expected. Our team is continuing to take action from our side and will continue monitoring the status with the vendor. Please expect further updates within the next hour and reach out to support through email at support@kustomer.com if you have additional questions or concerns.

Posted Dec 07, 2021 - 11:03 EST

This incident affected: Prod1 (US) (API, Channel - Chat, Channel - Email, Events / Audit Log, Exports, Notifications, Registration, Search, Tracking, Web Client, Web/Email/Form Hooks, Workflow), Prod2 (EU) (Notifications), and Third Party (Channel - Facebook, Channel - Instagram, Channel - SMS, Channel - WhatsApp).