[SEARCHES] Search Retrieval Issues (PROD2)
Incident Report for Kustomer
Postmortem

Summary

An organization’s automated process running a high volume of complex search queries strained our search system, causing reduced search functionality for multiple Prod2 organizations over the span of roughly 3 hours.

Root Cause

A high volume of complex queries in a short amount of time put excessive strain on our infrastructure. This negatively impacted the health of 2 nodes, affecting organizations that also rely on this same infrastructure.  

Timeline

Jan 12, 2025

6:00 AM - EST - A single organization’s automated user ramps up activity to significantly higher levels than normal

11:38 AM EST - Engineers start receiving alerts that various charts/reporting endpoints are generating errors

12:16 PM EST - Multiple clients report broken search functionality, we declare an incident

12:22 PM - Engineers identify that 2 of our nodes are reaching maximum CPU utilization 

12:33 PM EST - Engineers identify the organization, user, and query that are causing the issue

1:15 PM EST - Engineers temporarily blocked the organization generating the problematic queries, restoring functionality to all organizations except the blocked organization.

3:15 PM EST - Engineers disabled the block, full functionality is restored for all organizations

Lessons/Improvements

  • Better communications - This incident highlighted the need for us to establish better protocol and communications with organizations whose activity may be impacting infrastructure or other organizations. With such a protocol in place we can hopefully reduce the need to block a single organization.
  • Tune our blocking - In response to this incident, we implemented the ability to block a single user (rather than block an entire organization) that is disrupting our systems.
  • Better alerts - By more finely tuning our alerting system, we can get earlier notification of incidents such as this without having to wait for customers to report impaired functionality.
  • Rate limiting machine users - We are exploring a reasonable way to rate-limit activity by machine users so that no single machine user can overwhelm our systems, as seen in this incident.
Posted Jan 16, 2025 - 13:59 EST

Resolved
Kustomer has resolved an event affecting Prod2 instances that caused issues when attempting to retrieve search results.

After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Jan 12, 2025 - 15:41 EST
Update
After releasing a fix to restore search retrieval in Prod2 instances, this issue should now be resolved for All Prod2 clients except an isolated instance where we're working with the relevant stakeholders to ensure optimal functionality on a specific search query.

Please feel free to reach out to Kustomer Support at support@kustomer.com if you have additional queries or concerns.
Posted Jan 12, 2025 - 14:30 EST
Update
Kustomer has released an update and seeing indications of recovery on search retrieval in Prod2 instances.

We are monitoring this fix to ensure the issue is fully resolved. Please expect additional details within the next 30 minutes, and reach out to Kustomer Support at support@kustomer.com if you have further questions or concerns.
Posted Jan 12, 2025 - 13:52 EST
Monitoring
Kustomer has implemented an update to address the issue impacting search retrieval results in Prod2 instances.

Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer Support at support@kustomer.com if you have additional questions or concerns.
Posted Jan 12, 2025 - 13:40 EST
Update
We have identified the root cause of the issue retrieving search results in Prod2 instances and continue to work on implementing a solution.

Additional updates will be provided in the next 30 minutes and in the meantime, please reach out to Kustomer Support at support@kustomer.com for any further queries.
Posted Jan 12, 2025 - 13:30 EST
Update
Kustomer has identified the root cause of the ongoing issue retrieving search results in Prod2 instances.

We are focused on resolving this as quickly as possible and will provide updates in the next 30 minutes.

Please reach out to Kustomer Support at support@kustomer.com for any further questions or updates.
Posted Jan 12, 2025 - 13:00 EST
Identified
Kustomer is aware of an event affecting Searches that may cause issues retrieving search results.

Our team has identified the issue and is working on a resolution as soon as possible. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at support@kustomer.com for any further questions or updates.
Posted Jan 12, 2025 - 12:39 EST
Investigating
Kustomer is aware of an event affecting Searches that may cause issues retrieving search results.

Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at Support@kustomer.com for any further questions or updates.
Posted Jan 12, 2025 - 12:29 EST
This incident affected: Prod2 (EU) (Search).