Significant Slowness / Blank pages on Kustomer Platform
Incident Report for Kustomer
Postmortem

Postmortem: Kustomer Web Application Unresponsive for 10/20/19

Summary
Beginning at approximately 9:00am EDT on October 20, 2019, it was reported that the Kustomer application had become unresponsive. This persisted until 9:14am EDT of the same day. While the unresponsiveness was not automatically detected at the time, we have since identified the cause and have put additional measures in place to help ensure it does not occur again.

Impact
In the minutes leading up to the incident October 20, 2019, and until it subsided, a significant number of requests to the Kustomer API failed with a 5xx internal error. This impacted all consumers of the Kustomer application APIs, which includes the Kustomer web application and external API consumers, but not the Kustomer chat or knowledge base products.

Root Causes
A sudden high volume of requests to the production API gateway service, coupled with a period of low capacity at the start of the incident. The origin of these requests was an API token belonging to a specific organization posting to a specific webhook URL.

Detection
The issue was detected when reported by an organization experiencing symptoms mentioned above.

Resolution
The issue resolved itself once the API gateway service scaled up adequately to handle the traffic.

Action Items
The vast majority of errors were recorded in load balancer metrics, not by the application metrics where the bulk of our monitoring resided. We have extended our monitoring footprint to include load balancer metrics for the API gateway so that we can address future problems before they lead to performance degradation. [DONE]

We have added additional API gateway containers so that the API gateway service will scale out faster to meet demand. [DONE]

Implement interfaces that will allow us to proactively prevent traffic like this from causing undue traffic spikes on our platform. [IN PROGRESS]

Posted Nov 13, 2019 - 13:43 EST

Resolved
On Sunday, October 20, 2019, beginning at approximately 9:00am EST, several clients reported significant slowness across the Kustomer platform.

Symptoms included a blank page when attempting to use the Kustomer application. Attempts to refresh yielded no results and agents were unable to reply to conversations for a period of ~15 minutes.

Several clients experienced blank page persisted from 9:00-9:15am EST, and that slowness continued from 9:15-9:30am EST.
Posted Oct 21, 2019 - 09:00 EDT