On November 25, 2020, at approximately 9:00 AM EST, the Kustomer team started to receive multiple reports about searches not updating. We quickly discovered problems involving Kinesis and Cloudwatch, stemming from a larger AWS outage.
After determining the immediate impact, patches were deployed to continuously process data. We were able to restore some functionality by 9:40 AM EST. The platform started seeing recovery at around 10:00 PM EST and fully recovered by November 26, 2020, at 12:30 PM EST.
Services impacted:
The root cause was identified to be an outage at AWS in their Kinesis and Cloudwatch services.
See AWS post mortem: https://aws.amazon.com/message/11201
The incident started on November 25, 2020, at approximately 9:00 AM EST. After assessing the impact on the Kustomer platform, we were able to keep the system operational in a degraded state by deploying various patches to substitute the functionality from Kinesis being down. Additionally, because Cloudwatch provides the triggers to help the platform scale up based on load, we manually added more resources to handle any unexpected spike in traffic.
During the outage, we encountered an additional issue with one of the automated patches we deployed, that was responsible for updating search indexes. At 5:06 PM EST, we attempted to import a large batch of data that overwhelmed our database, resulting in high replication lag and prevented database queries from finishing and retrieving data. We discovered where the problem was, aborted the operation at 6:05 PM EST, and immediately started the search reindex process again.
At approximately 10:00 PM EST, we saw that Kinesis started to recover. We kept the patches in place until the outage was confirmed to be resolved by AWS on November 26, 2020 at 4:00 AM EST. We spent the next several hours processing backlogged data in our queues and the system was fully operational by 12:30 PM EST.
In order to keep processing data, we deployed 3 automated scripts running on a schedule. These scripts were responsible for updating search and reporting features, keeping data in sync, and triggering scheduled events to keep updating conversations.
We manually increased the counts of our services proactively to accommodate the loss of auto-scaling abilities.