Kustomer Platform Latency

Incident Report for Kustomer

Postmortem

Incident Report: API Latency - 11/14/19

Summary

On Thursday, November 14, 2019, from approximately 11:16 to 11:25am EDT, Kustomer’s APIs in the prod1 (US) pod experienced increased latency. The latency was caused by a sudden uptick in requests combined with a delay in our auto-scaling to support the traffic.

The traffic subsided on its own, but our team has since removed the code that could cause such a spike and adjusted our scaling policies to be more sensitive and more quickly scale our infrastructure out.

Root Causes

On November 12, 2019, as part of an upcoming feature release, the Kustomer team rolled out an update to request a user’s teams from our API every time any team in the organization was changed.

On November 14, 2019, an increase in the number of team changes exposed that this code was vulnerable to generating a sudden burst of requests against our API. Reporting revealed a notable spike in the number of requests to get teams, as well as a notable spike when looking at the overall level of API requests.

Separately, in investigating the resulting latency, our team observed that there was a longer-than-expected delay from when the increase in requests began to when our infrastructure scaled to support it. Without proportionate computing resources to handle requests, the increased volume impacted latency for longer than it might otherwise have.

Impact

Across the application, the slowest 1% of requests took approximately one minute, as opposed to our typical latency on the order of milliseconds. The increased latency lasted from approximately 11:17 to 11:26am EDT. Requests that could not be fulfilled during this window resulted in a 5xx status code to the user.

Resolution

The Kustomer engineering team was notified of latency by our own team and of errors by our automated alert system. The latency resolved itself with a decrease in traffic and the engineering team immediately reverted the changes that made the API vulnerable to such bursts.

Action Items

[COMPLETED] The code that enabled the spike in API requests was immediately rolled back, fixing a hole in the application.
[COMPLETED] Thresholds for scaling have been lowered, so that infrastructure scales sooner in response to additional activity.
[TO DO] Review and update monitoring on latency / apdex to ensure the team is notified sooner when customers are experiencing issues.

Posted Nov 25, 2019 - 13:18 EST

Resolved

The intermittent latency incident has been resolved.

Please reach out to support via our web-form at help.kustomer.com and select the "Submit Issue" link with any questions.

Posted Nov 14, 2019 - 14:37 EST

Monitoring

A fix is in progress and we will be monitoring results.

Posted Nov 14, 2019 - 12:05 EST

Update

We are continuing to investigate this issue.

Posted Nov 14, 2019 - 11:40 EST

Investigating

The Kustomer platform is currently experiencing intermittent increased latency. We are researching to determine a root cause and will continue to monitor. We will share updates here on our Status Page.

Please reach out to support via our web-form at help.kustomer.com and select the "Submit Issue" link with any questions.

Posted Nov 14, 2019 - 11:40 EST

This incident affected: Prod1 (US) (API).