Kustomer platform experiencing latency

Incident Report for Kustomer

Postmortem

Incident Report: API Outage - 11/25/19

Summary

At approximately 2:48pm EST on November 25, 2019, an instance of Kustomer’s cloud-managed caching service experienced a sudden hardware failure. While the application is configured to support a failed node and subsequent failover, the caching service did not terminate connections to the failed node as expected, causing pending calls to the caching service to queue in memory. This lead to heightened error rates, latency, and timed out requests across the system beginning at 2:49pm until all services were able to recover by 3:07pm.

Impact

Between 2:48pm and 3:07pm EDT, a number of requests to the Kustomer API gateway failed with an internal error. Average latency during this time increased substantially. The endpoints experiencing elevated errors and latency are critical to the operation of the Kustomer application, which resulted in the effects seen by users of the platform and integrations relying on APIs and workflows.

The result for many end users was that the Kustomer web app loaded to a blank screen or operations within the app were unresponsive. Various integrations may also have experienced higher latency and/or rates of internal errors. The running of queues and routing, workflows, and business rules were impacted during this time as well as a result.

Root Causes

An instance in Kustomer’s caching service experienced a sudden hardware issue at 2:48pm EST. While we have code in place to detect such failures and gracefully connect to a backup instance, no signals were sent to the application to connect to the new, healthy instance. As a result, the application continued to attempt to communicate with the unhealthy server.

After several minutes of queuing cache requests, the application was overwhelmed by the number of pending requests and crashed. The application stopped serving requests entirely, thus leading to the high number of errors observed. When the application restarted, it successfully connected to the new instance and resumed serving requests at the normal rate.

Resolution

Kustomer’s application largely recovered without intervention thanks to the failover mechanism in place. Once the root cause was understood, Kustomer engineers restarted all services that communicate with this cache service to ensure the issue had cleared.

The Kustomer team has spent considerable time improving our caching reconnection logic from past events and thoroughly tested it in non-production environments. In practice, though, we have found that there are differences between the manually triggered failover we tested against and the actual hardware failure we experienced. We are working with our hosting provider to understand these differences so that we can address them and test our application against them accordingly.

Action Items

[IN PROGRESS] Work with our service provider to understand why connections to the failed instance were kept alive even though failover had occurred. Ensure that the necessary adjustments are made to cover this specific contingency.
[IN PROGRESS] We are investigating alternative technologies that can offer easier failover with less impact on the production system in an effort to limit the role of the application in infrastructure failure.
[IN PROGRESS] We are exploring adding additional logic to our application that would improve failure detection, self-healing, and provide faster feedback in the event of failure.

Posted Nov 26, 2019 - 16:57 EST

Resolved

Full functionality and performance has been restored to the Kustomer platform. Our team is conducting a full investigation of and report on this incident, which will be shared here once complete. If you continue to experience any issues, please contact us a support@kustomer.com.

Posted Nov 25, 2019 - 15:39 EST

Monitoring

Performance across the Kustomer platform and affected components have begun to stabilize. Agents and users of the platform should now be able to work as normal. However, we are continuing to monitor this incident closely, and will share more information as it is available. Please let us know if you continue to experience any issues.

Posted Nov 25, 2019 - 15:17 EST

Investigating

The Kustomer platform is currently experiencing latency that affects API requests, inbound webhooks, workflows, outbound messaging, and events. Agents and users of the platform will notice error messages and issues with certain events occurring. We are actively working on this issue and will continue to share updates here. Please reach out to support@kustomer.com with additional questions.

Posted Nov 25, 2019 - 15:07 EST

This incident affected: Prod1 (US) (API, Events / Audit Log, Web/Email/Form Hooks, Workflow).