Platform Instability - Prod 2
Incident Report for Kustomer
Postmortem

Summary

On April 15, 2024 customers on Prod2 cluster experienced elevated latency and error rates on multiple features of the Kustomer product. 

Root Cause

A bulk operation resulted in an extremely high number of events within the system in a very short period of time, and the system was initially unable to scale fast enough to handle the load, resulting in a 2 hour period of instability.  

Timeline

Apr 15, 2024

6:28 AM EDT Our on-call engineers were alerted to an incident of high error rates in the platform, kicking off an investigation

7:51 AM EDT Kustomer’s support team began receiving reports of a portion of agents being unable to access the platform

8:58 AM EDT The bulk operation that caused the issue was disabled by our engineers

9:55 AM EDT Latency recovered and error rates decreased to pre-incident levels

12:08 PM EDT All related services fully recovered

Lessons/Improvements

  • Bulk Jobs: We identified a bug in our bulk job logic that could lead to larger than expected jobs running, and also identified some opportunities for improvement in how we rate limit bulk jobs and isolate them from the rest of the system.

    • We have fixed a bug in our bulk operations that caused the original bulk job to update many more records than expected.
    • We are actively evaluating improvements to the rate limiting of bulk operations and plan to implement changes in the coming weeks.
  • Scaling: We identified some inefficiencies in our scaling strategies related to recent changes in platform usage

    • We’ve made short term improvements to our scaling policies to increase platform stability as we investigate longer term solutions.
    • We are actively planning changes to isolate automations traffic in our APIs from web user traffic to prevent automations from destabilizing our web interface, and we plan to implement these changes in the coming weeks.
  • Monitoring: We are evaluating options to improve visibility of automation activity to make it easier to identify automations that are disproportionately impacting the system.

Posted Apr 19, 2024 - 15:04 EDT

Resolved
Kustomer has resolved an event affecting PROD 2 that caused platform latency issues. To resolve this issue, our team has pushed out an update to improve the performance of the platform.

Our engineering team has redriven multiple parts of the platform and after careful monitoring, our team has determined that our systems are now fully restored. Please reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Apr 15, 2024 - 12:08 EDT
Update
Kustomer has pushed out an update to improve the system and the team is currently redriving events that impacted the system.

Systems are currently operational at this time and our team is currently monitoring the system to ensure the issue remains resolved. Please expect further updates, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Apr 15, 2024 - 11:45 EDT
Monitoring
Kustomer is working on implementing a solution to improve the system and an update will be going out shortly. Systems are operational at this time and our team is currently monitoring the system to ensure the issue will be fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Apr 15, 2024 - 10:27 EDT
Update
Kustomer is aware of an event affecting Prod 2 org instances that may cause instability issues within the platform.

Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns. We are continuing to investigate the issue.
Posted Apr 15, 2024 - 08:56 EDT
Update
Kustomer is aware of an event affecting Prod 2 org instances that may cause instability issues within the platform.

Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns. We are continuing to investigate the issue.
Posted Apr 15, 2024 - 08:47 EDT
Investigating
Kustomer is aware of an event affecting Prod 2 org instances that may cause instability issues within the platform.

Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Apr 15, 2024 - 08:20 EDT
This incident affected: Prod2 (EU) (API).