[Platform Latency] [Delay in notifications and chat sending as well as any updates in the platform] [Prods 1 and 2]
Incident Report for Kustomer
Postmortem

Post Mortem: Intermittent Search Failures

Summary

On October 24, 2024, customers experienced intermittent failures when loading search and reporting features. Users may have been presented with a "Search is Unavailable" error when logging in or 429 Too Many Requests from the API.

Root Cause

The search system had an overwhelming number of queued items that were generated from an internal process that is responsible for cleaning data which caused high CPU load. This led to searches being rejected and unfulfilled 

Timeline

Oct 24, 2024

  • 8:13am: Users experiencing intermittent “Search is Unavailable” errors upon logging in and slow searches. Oncall engineers begin investigating.
  • 9:14am: Engineers identified high CPU load on the search system and continued the investigation into the root cause. Team begins to implement short term mitigations.
  • 11:02am: Mitigations to temporarily disable search were rolled out as oncall engineers monitor the CPU metrics on the search system.
  • 11:17am: CPU load remained elevated and mitigations were removed.
  • 11:55am: Additional mitigations implemented for customers on affected search nodes to disable search and reporting.
  •  1:47pm: Root cause identified. A large number of queued data deletion tasks were responsible for the high CPU load and were promptly purged. Upon mitigating the root cause, CPU metrics returned to expected levels.
  • 2:02pm: All search and reporting functionality restored for affected customers.

Lessons/Improvements

  • [DONE] Fix defect that resulted in a large number of queued tasks
  • [IN PROGRESS] Add additional monitors to improve alerting for stale queued tasks in search clusters
Posted Dec 20, 2024 - 15:09 EST

Resolved
Kustomer has resolved an event affecting the platform that caused a delay in notifications, chat sending and any updates in the platform. To resolve this issue, our team has worked with our third party vendor.

After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Oct 24, 2024 - 10:28 EDT
Monitoring
We are currently monitoring a fix implemented by our third party vendor related to delay's in notifications, chat sending and any updates in the platform.

Please expect further updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Oct 24, 2024 - 09:42 EDT
Identified
Kustomer has identified an event affecting Prods 1 and 2 that may cause a delay in notifications, chat sending and any updates in the platform. We are currently working with a third party vendor towards a resolution.

Our team is currently working to implement a resolution. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Oct 24, 2024 - 09:18 EDT
Update
We are continuing to investigate this issue.
Posted Oct 24, 2024 - 08:49 EDT
Investigating
Kustomer has identified an event affecting Prods 1 and 2 that may cause a delay in notifications and chat sending as well as any updates in the platform.

Our team is currently working to implement a resolution. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Oct 24, 2024 - 08:46 EDT
This incident affected: Prod1 (US) (Channel - Chat, Notifications), Prod2 (EU) (Channel - Chat, Notifications), and Third Party (PubNub).