Incident Report: API Outage - 11/25/19
At approximately 2:48pm EST on November 25, 2019, an instance of Kustomer’s cloud-managed caching service experienced a sudden hardware failure. While the application is configured to support a failed node and subsequent failover, the caching service did not terminate connections to the failed node as expected, causing pending calls to the caching service to queue in memory. This lead to heightened error rates, latency, and timed out requests across the system beginning at 2:49pm until all services were able to recover by 3:07pm.
Between 2:48pm and 3:07pm EDT, a number of requests to the Kustomer API gateway failed with an internal error. Average latency during this time increased substantially. The endpoints experiencing elevated errors and latency are critical to the operation of the Kustomer application, which resulted in the effects seen by users of the platform and integrations relying on APIs and workflows.
The result for many end users was that the Kustomer web app loaded to a blank screen or operations within the app were unresponsive. Various integrations may also have experienced higher latency and/or rates of internal errors. The running of queues and routing, workflows, and business rules were impacted during this time as well as a result.
An instance in Kustomer’s caching service experienced a sudden hardware issue at 2:48pm EST. While we have code in place to detect such failures and gracefully connect to a backup instance, no signals were sent to the application to connect to the new, healthy instance. As a result, the application continued to attempt to communicate with the unhealthy server.
After several minutes of queuing cache requests, the application was overwhelmed by the number of pending requests and crashed. The application stopped serving requests entirely, thus leading to the high number of errors observed. When the application restarted, it successfully connected to the new instance and resumed serving requests at the normal rate.
Kustomer’s application largely recovered without intervention thanks to the failover mechanism in place. Once the root cause was understood, Kustomer engineers restarted all services that communicate with this cache service to ensure the issue had cleared.
The Kustomer team has spent considerable time improving our caching reconnection logic from past events and thoroughly tested it in non-production environments. In practice, though, we have found that there are differences between the manually triggered failover we tested against and the actual hardware failure we experienced. We are working with our hosting provider to understand these differences so that we can address them and test our application against them accordingly.