An organization’s automated process running a high volume of complex search queries strained our search system, causing reduced search functionality for multiple Prod2 organizations over the span of roughly 3 hours.
A high volume of complex queries in a short amount of time put excessive strain on our infrastructure. This negatively impacted the health of 2 nodes, affecting organizations that also rely on this same infrastructure.
Jan 12, 2025
6:00 AM - EST - A single organization’s automated user ramps up activity to significantly higher levels than normal
11:38 AM EST - Engineers start receiving alerts that various charts/reporting endpoints are generating errors
12:16 PM EST - Multiple clients report broken search functionality, we declare an incident
12:22 PM - Engineers identify that 2 of our nodes are reaching maximum CPU utilization
12:33 PM EST - Engineers identify the organization, user, and query that are causing the issue
1:15 PM EST - Engineers temporarily blocked the organization generating the problematic queries, restoring functionality to all organizations except the blocked organization.
3:15 PM EST - Engineers disabled the block, full functionality is restored for all organizations