[Prod 1] [Searches and Charts] Currently experiencing errors within the platform for some orgs
Incident Report for Kustomer
Postmortem

Summary

We had an increase in errors across search and reporting for a subset of customers in Prod1

Root Cause

We had multiple  organizations with badly performing search queries that were continuously polling. This led to degraded performance across our cluster as the cascading effect led to other search queries beginning to slow down being in queue and our CPU usage began to increase. We identified the two orgs who had searches that were negatively impacting the cluster and stopped those searches to resolve the issue

Timeline

  • 9:35 AM EDT Increased latency began with search and reporting
  • 9:52 AM EDT We began to get alerted to search related errors in the platform and began investigating the root cause
  • 10:25 AM EDT We identified that the root cause was an abundance of slow search queries being backed up in their queue which increased our CPU usage. We then began investigating the source of the slow queries
  • 11:30 AM EDT We identified an org that we believed to be the source due to having badly performing search queries running. We stopped the searches for that org which improved performance but didn’t resolve it fully as CPU usage remained high. We continued investigating to find what other org might be leading to the increased CPU usage
  • 12:56 AM EDT We stopped the badly performing searches for a second org which improved performance back to pre-incident levels

Lessons/Improvements

  • Decrease the impact of badly performing searches - We’ve begun limiting the cases in which we poll for searches and have also begun disabling polling for badly performing searches. We are updating our search builder so that users can consolidate search queries to make them more performant

    • [DONE] Automatically disabling badge count polling on badly performing searches
    • [IN PROGRESS] Upgrading our hardware to a more recent version that is more resilient and more performant
    • [TO DO] Improved the search builder to give it the ability to consolidate queries 
  • Improve internal monitoring for search failures - We were alerted to search related errors but during investigation realized that we could improve our monitoring and alerting to be notified of the issues earlier so that we could triage faster

    • [IN PROGRESS] Investigate and implement optimal monitoring and alerting strategies for slow search queries
Posted Oct 09, 2024 - 15:28 EDT

Resolved
Kustomer has resolved an event affecting Searches and Charts.

After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Sep 30, 2024 - 16:32 EDT
Update
Kustomer has implemented an update to address an event affecting Searches and Charts that may cause issues within the platform for some orgs at this time.

Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Sep 30, 2024 - 13:53 EDT
Monitoring
Kustomer has implemented an update to address an event affecting Searches and Charts that may cause issues within the platform for some orgs at this time.

Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Sep 30, 2024 - 12:56 EDT
Update
Kustomer is aware of an event affecting Searches and Charts that may cause issues within the platform for some orgs at this time.

Our team is still currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Sep 30, 2024 - 12:01 EDT
Update
Kustomer is aware of an event affecting Searches and Charts that may cause issues within the platform for some orgs at this time.

Our team is still currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Sep 30, 2024 - 11:28 EDT
Investigating
Kustomer is aware of an event affecting Searches and Charts that may cause issues within the platform for some orgs at this time.

Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Sep 30, 2024 - 10:44 EDT
This incident affected: Prod1 (US) (Search).