On May 16, 2022, an issue in a release of workflow resulted in a brief period between 11:08 AM to 11:14 AM where events were not being processed by workflow. When this release was rolled back, a backlog of queued messages for workflow resulted in overwhelming Kustomer’s core API with traffic. This traffic caused increased latency in the core API and then increased error rates, resulting in a degraded platform experience loading timelines and conversations.
When an update to workflow was released, it included an upgraded dependency from a third party service. With this upgraded dependency, the workflow service became unable to parse large JSON event bodies pulled from S3 due to the change of a return type in a method in the dependency.
05/16 10:44 AM - Deployment of new Workflow code.
05/16 11:15 AM - Rollback of Workflow initiated.
05/16 11:21 AM - Core APIs begin to experience increased latency.
05/16 11:28 AM - Incident declared with workflow processing delays.
05/16 11:52 AM - Engineers manually scale out resources for core API services to handle additional traffic.
05/16 11:53 AM - Latency in core API is resolved.
05/16 1:14 PM - Per TSE team, new conversations related to the incident were no longer coming in.
05/16 1:40 PM - Events that were not processed during the incident are re-driven.