Users unable to load Kustomer application

Incident Report for Kustomer

Postmortem

Post Mortem: Kustomer APIs Unresponsive

Summary

On January 23, 2020, beginning at 4:19 PM ET and lasting until 5:19 PM ET, the Kustomer API Gateway that serves as the entrypoint for Kustomer’s Prod1 (US) environment experienced severe latency and an elevated error rate. This incident was triggered by a code change that impacted the efficiency of the API Gateway service. As a result, the performance of the Kustomer platform, chat, knowledge base, and web application were severely diminished.

Impact

During the impact window, average API latency increased to 7 seconds per request. This impacted all consumers of the Kustomer application APIs, which includes the Kustomer web application, webhooks, and inbound messages. This was resolved by 4:42 PM ET. However, chat and knowledge base continued to experience higher than normal latencies until 5:19 PM ET.

Root Cause

The incident was the result of a deploy containing both updated application code and altered memory settings for the underlying application runtime. After testing in an isolated test environment, Kustomer engineers were able to isolate the issue to the changes made to the application’s memory settings. Although these memory settings are in use elsewhere on the platform, the volume and nature of work performed by the API gateway differs from those other services. The result was degraded rather than improved application performance.

Trigger

The updated runtime settings mentioned above were deployed to the API gateway service at 4:15 PM ET.

Resolution

As with all sensitive deployments, our engineering team was proactively monitoring application performance during the release. By 4:19 PM ET, the engineering team began to observe increasing latencies in the application. At 4:23 PM ET, the engineering team issued a rollback to reverse the changes. All signs of latency in the primary API Gateway were gone by 4:41 PM ET. We then followed the same steps for the separate API Gateway cluster that powers chat and knowledge base, resulting in full service restoration at 5:19 PM ET.

Additionally, on the day following the incident we ran numerous tests in order to identify the root cause. We managed to verify the updated memory settings as the culprit by running four different tests in an isolated container in production. The results of these tests revealed that regardless of which version of the application code was being run, only those with the updated memory settings exhibited issues.

Lessons Learned & Action Items

Reverse the modifications made to the API Gateway’s memory settings that triggered this incident [DONE]
Test each service individually in isolation before deploying modified runtime settings that lie outside the application code [ONGOING]
Review and improve our cloud host autoscaling policies to ensure rollbacks can always succeed and occur quickly [IN-PROGRESS]
Provision a non-production environment with comparable traffic to our Prod1 (US) environment to test against realistic load prior to release [IN-PROGRESS]
Improve our deployment process to use other deployment methodologies that facilitate early-warnings as well as faster rollback capabilities [IN-PROGRESS]

Posted Jan 28, 2020 - 18:11 EST

Resolved

This incident has been resolved.

Posted Jan 23, 2020 - 17:56 EST

Monitoring

At this time, Kustomer chat is fully operational. We are continuing to monitor the issue, and will share details as they become available. Please contact support@kustomer.com if you are still experiencing any issues.

Posted Jan 23, 2020 - 17:25 EST

Investigating

The Kustomer application continues to be accessible and functional at this time. However, Kustomer chat is experiencing errors, which may prevent some customers from writing in via this channel. We are investigating the problem and will share updates as soon as possible.

Posted Jan 23, 2020 - 17:11 EST

Monitoring

The Kustomer application has recovered and all users should now be able to access the platform without issue. We will continue to monitor this incident until we are certain that it will not recur. Please contact support@kustomer.com with any questions or concerns.

Posted Jan 23, 2020 - 16:48 EST

Update

We are continuing to investigate this issue.

Posted Jan 23, 2020 - 16:41 EST

Investigating

We are currently investigating this issue and will share information as soon as it is available.

Posted Jan 23, 2020 - 16:35 EST

This incident affected: Prod1 (US) (API, Channel - Chat, Web Client).