On Friday June 9th, 2023, Elastic Cloud clusters in our US Prod1 POD were non-responsive. Events originating from this region were not properly indexed in ElasticSearch, leading to degraded Search and Reporting functionality in Kustomer. This will have affected a small subset of customers from 7pm EST to 8:40 pm EST.
Elastic Cloud suffered an incident resulting in connectivity loss to clusters in their us-east-1 region, which is where Kustomer’s US Prod1 POD resides. Elastic Cloud has provided a Root Cause Analysis, which indicates the following:
06/09 6:59 pm - Elastic Cloud clusters in the us-east-1 region become unresponsive, leading to failed requests to index data in ElasticSearch
06/09 7:02 pm - Engineers are alerted of Elastic Cloud cluster connectivity issues
06/09 7:05 pm - Incident Engineer begins internal incident process
06/09 7:11 pm - Incident Engineer escalates issue
06/09 7:24 pm - Elastic Cloud updates their StatusPage to indicate an ongoing investigation into connectivity issues with us-east-1 clusters
06/09 7:42 pm - Elastic Cloud confirms proxy incident in us-east-1 via StatusPage, and that they are working towards a solution
06/09 8:20 pm - Elastic Cloud cluster connectivity is restored, and Kustomer resumes indexing events in Elastic Cloud clusters
06/09 8:42 pm - Search and Reporting functionality in Kustomer is fully restored
While Kustomer does have multi-region support and regional specific DR strategies for our primary cloud hosted databases and search clusters, there are additional opportunities to improve the time to recover in these situations. The engineering team is continually improving our system availability, and actively working on projects that will further improve our performance and uptime. Some specific action items include: