Delays in Timeline Load Times
Incident Report for Kustomer
Postmortem

Summary

On May 16, 2022, an issue in a release of workflow resulted in a brief period between 11:08 AM to 11:14 AM where  events were not being processed by workflow. When this release was rolled back, a backlog of queued messages for workflow  resulted in overwhelming Kustomer’s core API  with traffic. This traffic caused increased latency in the core API and then increased error rates, resulting in a degraded platform experience loading timelines and conversations.

Root Cause

When an update to workflow was released, it included an upgraded dependency from a third party service. With this upgraded dependency, the workflow service became unable to parse large JSON event bodies pulled from S3 due to the change of a return type in a method in the dependency.

Timeline

05/16 10:44 AM - Deployment of new Workflow code.

05/16 11:15 AM - Rollback of Workflow initiated.

05/16 11:21 AM - Core APIs begin to experience increased latency.

05/16 11:28 AM - Incident declared with workflow processing delays.

05/16 11:52 AM - Engineers manually scale out resources for core API services to handle additional traffic.

05/16 11:53 AM - Latency in core API is resolved.

05/16 1:14 PM - Per TSE team, new conversations related to the incident were no longer coming in. 

05/16 1:40 PM - Events that were not processed during the incident are re-driven.

Lessons/Improvements

  • Formulate a plan to mitigate workflow placing excessive downstream pressure on on Kustomer’s core APIs (To Do)
  • Implement a health check to automatically detect and replace services in a bad state (In Progress)
Posted May 25, 2022 - 09:12 EDT

Resolved
Kustomer has resolved an event affecting latency and customer timelines within Prod 1 environments. After careful monitoring, our team has found that all affected areas are fully restored. Please reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted May 16, 2022 - 12:12 EDT
Monitoring
Kustomer has implemented an update to address an event affecting latency and timeline load times in PROD 1 environments. If this issue persists on your device, we advise to clear the cache on your machine in order to resolve the issue locally. Our team will continue to monitor this update to ensure the issue is fully resolved. Please expect further updates within 1 hour and reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted May 16, 2022 - 12:03 EDT
Investigating
Kustomer has implemented an update to address an event affecting customer timelines loading in PROD 1 environments. Our team will continue to monitor this update to ensure the issue is fully resolved and Kustomer platforms are stabilized. Please expect further updates within 1 hour or sooner and reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted May 16, 2022 - 11:49 EDT
This incident affected: Prod1 (US) (Workflow).