Core Systems - Latency Affecting On Customer And Conversation Timelines
Incident Report for Kustomer
Postmortem

Summary

On July 22 2024 customers on Kustomer’s Prod1 environment experienced elevated latency and error rates on multiple features of the Kustomer product. 

Root Cause

One of Kustomer’s primary databases experienced a hardware failure, resulting in a switchover to a secondary database.  Requests made during the 90 second period between the failure and successful switchover were unsuccessful.  The service responsible for rendering the Kustomer timeline failed to immediately switch over to the new primary node for an additional 8 minutes, resulting in additional time till the customer timelines were usable

Timeline

Jul 22, 2024

  • 10:30 AM AM EDT: A hardware failure occurred on a primary database node
  • 10:31:38 AM EDT: The failure was detected and traffic began routing to a new primary database node.  At this point, non-timeline requests began succeeding
  • 10:50 AM EDT: The service that renders the kustomer timeline fully recovered and the platform functioned normally

Lessons/Improvements

Improved internal monitoring for database failures - Our team was alerted of the failures and began investigating immediately, but did not have immediate visibility into the cause of the failures.  We’ve improved our database monitoring to allow for quicker response times in the case of a future hardware failure like this. 

Perform additional failover testing - We intend to perform additional testing of failover scenarios in non-production environments to discover additional opportunities to optimize this process and reduce disruption to customers.

Posted Aug 15, 2024 - 16:28 EDT

Resolved
Kustomer has resolved an event affecting core platform systems that caused latency in searches and timelines.

After careful monitoring, our team has concluded that all affected areas are now fully restored. Please reach out to Kustomer support at support@kustomer.com with any additional questions or concerns.
Posted Jul 22, 2024 - 11:17 EDT
Update
We are continuing to monitor for any further issues.
Posted Jul 22, 2024 - 11:05 EDT
Monitoring
Kustomer has identified of an event affecting platform system that may cause customer and conversation timelines within the platform to experience trouble loading.

Our team has implemented an update to address this issue and is monitoring to verify that it is resolved. Please expect additional updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Jul 22, 2024 - 11:01 EDT
This incident affected: Prod1 (US) (Channel - Chat, Channel - Email, Channel - Facebook, Channel - Instagram, Channel - SMS, Channel - Twitter, Channel - WhatsApp, Search).