On January 23, 2020, beginning at 4:19 PM ET and lasting until 5:19 PM ET, the Kustomer API Gateway that serves as the entrypoint for Kustomer’s Prod1 (US) environment experienced severe latency and an elevated error rate. This incident was triggered by a code change that impacted the efficiency of the API Gateway service. As a result, the performance of the Kustomer platform, chat, knowledge base, and web application were severely diminished.
During the impact window, average API latency increased to 7 seconds per request. This impacted all consumers of the Kustomer application APIs, which includes the Kustomer web application, webhooks, and inbound messages. This was resolved by 4:42 PM ET. However, chat and knowledge base continued to experience higher than normal latencies until 5:19 PM ET.
The incident was the result of a deploy containing both updated application code and altered memory settings for the underlying application runtime. After testing in an isolated test environment, Kustomer engineers were able to isolate the issue to the changes made to the application’s memory settings. Although these memory settings are in use elsewhere on the platform, the volume and nature of work performed by the API gateway differs from those other services. The result was degraded rather than improved application performance.
The updated runtime settings mentioned above were deployed to the API gateway service at 4:15 PM ET.
As with all sensitive deployments, our engineering team was proactively monitoring application performance during the release. By 4:19 PM ET, the engineering team began to observe increasing latencies in the application. At 4:23 PM ET, the engineering team issued a rollback to reverse the changes. All signs of latency in the primary API Gateway were gone by 4:41 PM ET. We then followed the same steps for the separate API Gateway cluster that powers chat and knowledge base, resulting in full service restoration at 5:19 PM ET.
Additionally, on the day following the incident we ran numerous tests in order to identify the root cause. We managed to verify the updated memory settings as the culprit by running four different tests in an isolated container in production. The results of these tests revealed that regardless of which version of the application code was being run, only those with the updated memory settings exhibited issues.