Kustomer Platform Delays
Incident Report for Kustomer
Postmortem

Kustomer Chat, KB, and Agent Login Unavailable

Summary

On April 15, 2020, beginning at 6:16 PM ET and lasting until 7:17 PM ET, several components of the Kustomer platform were unavailable. This was caused by a sustained surge of traffic of unprecedented volume to the public-facing APIs that power Kustomer Chat, Knowledge Base, and login for the Kustomer Web Application. After identifying the source of traffic, we implemented a rule in our edge firewall to block traffic from that specific source, resolving the issue.

Impact

During the incident window all Kustomer Chat, Knowledge Base sites, and login to the Kustomer Web Application was unavailable.

Root Cause

The issue was caused by a software bug in a third party app that integrates with a version of the Kustomer Chat SDK for Android devices. This bug triggered a loop that repeatedly made calls to our public-facing API gateway. Because this bug was present in a mobile Android app with a very large install base, the volume of traffic generated by the bug was several orders of magnitude higher than our normal peak traffic.

Trigger

Our automated monitoring system alerted us to significant latency at 6:16 PM ET.

Resolution

We took a two-pronged approach to managing the crisis. One group of engineers began examining logs and metrics to identify the source of the increased traffic, while a second group focused on expanding our API capacity to mitigate the impact of the traffic. Once the first group identified the traffic as coming from a specific version of the Kustomer Chat SDK embedded in a specific client’s Android app, we were able to add a rule to our firewall that excluded traffic from that source. The impacted components returned to normal operation within minutes of deploying the new firewall rule.

Lessons Learned & Action Items

  • [IN PROGRESS] Isolate Kustomer Chat, Knowledge Base, and agent login APIs to limit the impact of incidents such as these.
  • [IN PROGRESS] Extend our traffic monitoring so that we can identify sources of bad traffic more quickly.
  • [IN PROGRESS] Integrate our firewall with monitoring so that rules can be added dynamically before they have an adverse effect on the system as a whole.
  • [IN PROGRESS] Improve the embedded Kustomer Chat SDK so that it is more resilient in the face of unintentional misuse of its functionality.
Posted Apr 16, 2020 - 15:39 EDT

Resolved
Earlier issues with chat availability, loading issues with knowledge base help sites, and login failures have all been resolved. All systems are performing consistently and as expected. Please reach out to support@kustomer.com with any questions. If you continue to see issues, visit https://help.kustomer.com, select Contact Support, and submit a support request with the details of your issue.

Thank you for your patience.
Posted Apr 15, 2020 - 19:55 EDT
Monitoring
A fix has been deployed and affected components are now working as expected. We are continuing to monitor the situation and will address any lingering issues that may come up.

Please reach out to Support via email at support@kustomer.com if you continue to experience issues.
Posted Apr 15, 2020 - 19:17 EDT
Identified
We've identified the issue causing login attempts to fail, delays in chat, and loading issues on the knowledge base and are deploying a solution. Users will be unable to login for the moment.
Posted Apr 15, 2020 - 19:05 EDT
Investigating
We are investigating an issue with delays on chat, knowledge base pages, and sign in attempts.
Posted Apr 15, 2020 - 18:51 EDT
This incident affected: Prod1 (US) (API, Channel - Chat).