Incident Report: API Latency - 11/14/19
On Thursday, November 14, 2019, from approximately 11:16 to 11:25am EDT, Kustomer’s APIs in the prod1 (US) pod experienced increased latency. The latency was caused by a sudden uptick in requests combined with a delay in our auto-scaling to support the traffic.
The traffic subsided on its own, but our team has since removed the code that could cause such a spike and adjusted our scaling policies to be more sensitive and more quickly scale our infrastructure out.
On November 12, 2019, as part of an upcoming feature release, the Kustomer team rolled out an update to request a user’s teams from our API every time any team in the organization was changed.
On November 14, 2019, an increase in the number of team changes exposed that this code was vulnerable to generating a sudden burst of requests against our API. Reporting revealed a notable spike in the number of requests to get teams, as well as a notable spike when looking at the overall level of API requests.
Separately, in investigating the resulting latency, our team observed that there was a longer-than-expected delay from when the increase in requests began to when our infrastructure scaled to support it. Without proportionate computing resources to handle requests, the increased volume impacted latency for longer than it might otherwise have.
Across the application, the slowest 1% of requests took approximately one minute, as opposed to our typical latency on the order of milliseconds. The increased latency lasted from approximately 11:17 to 11:26am EDT. Requests that could not be fulfilled during this window resulted in a 5xx status code to the user.
The Kustomer engineering team was notified of latency by our own team and of errors by our automated alert system. The latency resolved itself with a decrease in traffic and the engineering team immediately reverted the changes that made the API vulnerable to such bursts.