Kustomer is experiencing latency in conversations loading

Incident Report for Kustomer

Postmortem

On Wednesday, January 12th, 2022, the main database cluster experienced a strongly elevated write load in our US Prod 1 POD that used up virtually all of its write capacity.

The release of a new feature introduced a database query that performed much worse than expected under production load. While the query was properly indexed, the enormous amount of conversations caused these queries to take an unusually long time to finish. The issue was not reproducible in our testing environments.

Impact

From 6:15 PM ET through 7:19 AM ET, customers experienced higher latency on conversation and timelines loading—sometimes not loading at all—as a majority of requests were timing out.

The affected database cluster began to reject write requests due to the large amount of long-running operations using up most available write tickets. In this situation, there is no threat to data that already exists inside the database.

What We Learned

What Went Well

Our alerting worked as expected. We were notified right away that there was a problem so we could quickly react to it.
We began rolling back the code release containing the badly performing query immediately and initiated a failover to a secondary database node

Areas for Improvement

Despite an immediate response, it took us a bit longer than expected to completely resolve the incident. We are planning to make improvements to our internal tooling to make this process quicker.

Posted Jan 13, 2022 - 18:03 EST

Resolved

The rollback is now completed and systems have normalized. You may need to refresh the page if you continue to see loading issues!

If you continue to see issues are latency and loading, please reach out to support@kustomer.com

Posted Jan 12, 2022 - 19:22 EST

Monitoring

The rollback is complete and you should be seeing conversations load normally now.

Posted Jan 12, 2022 - 18:57 EST

Identified

Kustomer is experiencing latency and we have identified the issue and are rolling back the changes that caused the issue. Should be fixed in 10 minutes.

Posted Jan 12, 2022 - 18:45 EST

This incident affected: Prod1 (US) (Channel - Chat, Channel - Email, Channel - Facebook, Channel - Instagram, Channel - SMS, Channel - Twitter, Channel - WhatsApp).