On January 17th, 2022 at approximately 12:37 PM EST, the Kustomer team identified that customers on the PROD 1 pod were not receiving up to date search results. Upon investigating, the engineers discovered that a shard containing data for some orgs on prod1 was dangerously large in size which resulted in the cluster entering an unhealthy state. Accordingly, engineers needed to make adjustments so that our system could accommodate more data on the affected shard.
A shard containing data for some orgs on PROD 1 was discovered to be dangerously large in size which resulted in the cluster entering an unhealthy state. Once the problem was identified, the engineering team provisioned a new cluster and restored the data for the affected orgs to this new cluster. A subsequent script was run to recover data that was not indexed over the incident.
01/17 12:37 PM EST - Multiple alarms signaled issues with the PROD 1 elasticsearch cluster and reports from customers came in indicating that some customers on prod1 were not receiving up to date search results.
01/17 1:33 PM EST - Engineers identified the problematic shard on the cluster and began to work on a solution to restore the cluster’s health
01/17 4:27 PM EST - A solution was deployed to production after previous testing in lower environments. Search results began to have their data caught up to the present.
01/18 1:06 AM EST - All affected customers were receiving new data in search results.
01/19 1:23 AM EST - All affected orgs had search results caught up to reflect system changes that occurred during the incident.
The overall health of the search cluster is monitored carefully, but this incident exposed weakness in a specific monitor which led to a prolonged recovery time. Accordingly, the engineering team is responding by: