On Tuesday, December 7th, 2021 the Kustomer application was unavailable on our US Prod 1 POD due to a major AWS incident. Inbound and outbound messages, workflows, business rules, Queues & Routing, and search were heavily impacted for all customers in this environment, rendering Kustomer mostly inoperable.
In terms of impact, at around 10:45 AM ET through 11:22 AM ET, customers were experiencing extreme delays sending and receiving messages and saw stale data across various parts of the platform like searches and timelines. From 11:22 PM ET through 5:33 PM ET the system further degraded, preventing all inbound and outbound messages from being sent, timelines from rendering, events from processing, and in some cases, customers from logging into our web app. After various attempts at restoration and a few partial recoveries, the system was finally brought to a healthy state by 8:22 PM ET and all deadlettered items were re-driven by 11:17 PM ET. The system was inoperable for about 6 hours.
The incident we experienced was extreme and re-surfaced the need to focus our efforts to improve our cross-regional redundancy. We rely very heavily on AWS infrastructure and have applied common best practices to run each production POD across multiple Availability Zones (Multi-AZ architecture) within a region to deliver resilience and ensure continuous availability in the event of an Availability Zone failure. However, we did not have cross-regional redundancy in place to prevent the regional failure we saw on Tuesday, Dec 7th.
During this incident, multiple AWS APIs in the US-EAST-1 region were impacted by issues with several network devices affecting multiple redundancy points in the same region. Since our system heavily relies on the APIs affected, this caused moderate and then significant latency that led to the Kustomer application becoming unresponsive for customers on our Prod 1 Production POD.
We tried to deploy backend services but were unable to do so because our deployment tools were impacted by AWS ECR, preventing us from being able to register and pull images down from the registry. We tried to start new containers to get into a healthy state but this was also affected by AWS SSM being down. We attempted to start containers for Standard Objects API and other services to allow them to scale up, but this was also impacted by the AWS EC2 issues.
By midday we started to explore working around the individual AWS Service issues, first enabling SSM in US-EAST-2. Our team also started working on a manual deployment of Standard Objects API Service to Prod 1 POD but we had to reassess our technical options since the AWS APIs were not acting as expected when trying to create new ECS instances as they were stuck in an unhealthy state.
After a few hours of hitting a number of roadblocks attempting to circumvent the impacted AWS services within US-EAST-1, the team started to pursue a new path of moving application services to an entirely new ECS cluster in the US-EAST-2 region within Prod 1 POD.
By 5:33 PM ET the team noticed that EC2 APIs were beginning to recover and were able to start the Standard Objects API services again. Conversations were starting to appear in the web application and searches were returning results. We shifted our focus from the workarounds we were solutioning to aiding in the recovery of the platform by scaling containers and database clusters to help keep up with the inbound traffic.
It took a few hours for the system to process the backlog of events queued, which delivered a more significant load than usual and resulted in additional latency during the recovery. The team increased capacity to handle this increased traffic. By 9:00 PM ET all events in our largest queue finished processing and latency dropped to normal thresholds.
Once the backlog of events completed processing, we upgraded our API and Workflow Caching clusters to reduce latency further and started our re-driving efforts for all dead letter queues. By 11:17 PM ET all events within dead letter queues were re-driven.
There were many things that went well along the way that we want to continue to do and many things that we can and need to improve to ensure that our application remains unaffected by future regional incidents.
What Went Well
Since the regrettable event on Dec 7th, the team has taken the time to understand what went wrong and will fast-track efforts to greatly expand the fail-safes that were already in place. The team is committed to improving the resiliency of the platform.
We’ll begin by initiating a comprehensive review of our systems to identify the shortcomings in our current redundancies. While this is a multi-phase effort that we’ll continue to iterate on over time, we expect to be able to start seeing the benefits from this over the next few months. The end goal of this effort will be a platform that is increasingly resilient and greatly reduces the time it takes for the system to recover.
Here are a few things we’ll be doing in the short to long term to increase availability during incidents:
All times shown are in Eastern Standard Time.