On April 15, 2024 customers on Prod2 cluster experienced elevated latency and error rates on multiple features of the Kustomer product.
Root Cause
A bulk operation resulted in an extremely high number of events within the system in a very short period of time, and the system was initially unable to scale fast enough to handle the load, resulting in a 2 hour period of instability.
Apr 15, 2024
6:28 AM EDT Our on-call engineers were alerted to an incident of high error rates in the platform, kicking off an investigation
7:51 AM EDT Kustomer’s support team began receiving reports of a portion of agents being unable to access the platform
8:58 AM EDT The bulk operation that caused the issue was disabled by our engineers
9:55 AM EDT Latency recovered and error rates decreased to pre-incident levels
12:08 PM EDT All related services fully recovered
Lessons/Improvements
Bulk Jobs: We identified a bug in our bulk job logic that could lead to larger than expected jobs running, and also identified some opportunities for improvement in how we rate limit bulk jobs and isolate them from the rest of the system.
Scaling: We identified some inefficiencies in our scaling strategies related to recent changes in platform usage
Monitoring: We are evaluating options to improve visibility of automation activity to make it easier to identify automations that are disproportionately impacting the system.