Postmortem: Kustomer Web Application Unresponsive for 10/20/19
Beginning at approximately 9:00am EDT on October 20, 2019, it was reported that the Kustomer application had become unresponsive. This persisted until 9:14am EDT of the same day. While the unresponsiveness was not automatically detected at the time, we have since identified the cause and have put additional measures in place to help ensure it does not occur again.
In the minutes leading up to the incident October 20, 2019, and until it subsided, a significant number of requests to the Kustomer API failed with a 5xx internal error. This impacted all consumers of the Kustomer application APIs, which includes the Kustomer web application and external API consumers, but not the Kustomer chat or knowledge base products.
A sudden high volume of requests to the production API gateway service, coupled with a period of low capacity at the start of the incident. The origin of these requests was an API token belonging to a specific organization posting to a specific webhook URL.
The issue was detected when reported by an organization experiencing symptoms mentioned above.
The issue resolved itself once the API gateway service scaled up adequately to handle the traffic.
The vast majority of errors were recorded in load balancer metrics, not by the application metrics where the bulk of our monitoring resided. We have extended our monitoring footprint to include load balancer metrics for the API gateway so that we can address future problems before they lead to performance degradation. [DONE]
We have added additional API gateway containers so that the API gateway service will scale out faster to meet demand. [DONE]
Implement interfaces that will allow us to proactively prevent traffic like this from causing undue traffic spikes on our platform. [IN PROGRESS]