On April 15, 2020, beginning at 6:16 PM ET and lasting until 7:17 PM ET, several components of the Kustomer platform were unavailable. This was caused by a sustained surge of traffic of unprecedented volume to the public-facing APIs that power Kustomer Chat, Knowledge Base, and login for the Kustomer Web Application. After identifying the source of traffic, we implemented a rule in our edge firewall to block traffic from that specific source, resolving the issue.
During the incident window all Kustomer Chat, Knowledge Base sites, and login to the Kustomer Web Application was unavailable.
The issue was caused by a software bug in a third party app that integrates with a version of the Kustomer Chat SDK for Android devices. This bug triggered a loop that repeatedly made calls to our public-facing API gateway. Because this bug was present in a mobile Android app with a very large install base, the volume of traffic generated by the bug was several orders of magnitude higher than our normal peak traffic.
Our automated monitoring system alerted us to significant latency at 6:16 PM ET.
We took a two-pronged approach to managing the crisis. One group of engineers began examining logs and metrics to identify the source of the increased traffic, while a second group focused on expanding our API capacity to mitigate the impact of the traffic. Once the first group identified the traffic as coming from a specific version of the Kustomer Chat SDK embedded in a specific client’s Android app, we were able to add a rule to our firewall that excluded traffic from that source. The impacted components returned to normal operation within minutes of deploying the new firewall rule.