On Wednesday, January 12th, 2022, the main database cluster experienced a strongly elevated write load in our US Prod 1 POD that used up virtually all of its write capacity.
The release of a new feature introduced a database query that performed much worse than expected under production load. While the query was properly indexed, the enormous amount of conversations caused these queries to take an unusually long time to finish. The issue was not reproducible in our testing environments.
From 6:15 PM ET through 7:19 AM ET, customers experienced higher latency on conversation and timelines loading—sometimes not loading at all—as a majority of requests were timing out.
The affected database cluster began to reject write requests due to the large amount of long-running operations using up most available write tickets. In this situation, there is no threat to data that already exists inside the database.
What Went Well
Areas for Improvement