Reported Third Party Event - ElasticCloud
Incident Report for Kustomer
Postmortem

Summary

On Friday June 9th, 2023, Elastic Cloud clusters in our US Prod1 POD were non-responsive. Events originating from this region were not properly indexed in ElasticSearch, leading to degraded Search and Reporting functionality in Kustomer. This will have affected a small subset of customers from 7pm EST to 8:40 pm EST.

Root Cause

Elastic Cloud suffered an incident resulting in connectivity loss to clusters in their us-east-1 region, which is where Kustomer’s US Prod1 POD resides.  Elastic Cloud has provided a Root Cause Analysis, which indicates the following:

  • Proxies were deployed to us-east-1 with an invalid configuration, causing all proxies to be non-functional
  • Deployment process did not detect failure during deployment which eventually resulted in all proxies being deployed with invalid configurations

Timeline

06/09 6:59 pm -  Elastic Cloud clusters in the us-east-1 region become unresponsive, leading to failed requests to index data in ElasticSearch

06/09 7:02 pm - Engineers are alerted of Elastic Cloud cluster connectivity issues

06/09 7:05 pm - Incident Engineer begins internal incident process

06/09 7:11 pm - Incident Engineer escalates issue

06/09 7:24 pm - Elastic Cloud updates their StatusPage to indicate an ongoing investigation into connectivity issues with us-east-1 clusters

06/09 7:42 pm - Elastic Cloud confirms proxy incident in us-east-1 via StatusPage, and that they are working towards a solution

06/09 8:20 pm - Elastic Cloud cluster connectivity is restored, and Kustomer resumes indexing events in Elastic Cloud clusters

06/09 8:42 pm - Search and Reporting functionality in Kustomer is fully restored

Lessons/Improvements

While Kustomer does have multi-region support and regional specific DR strategies for our primary cloud hosted databases and search clusters, there are additional opportunities to improve the time to recover in these situations. The engineering team is continually improving our system availability, and actively working on projects that will further improve our performance and uptime.  Some specific action items include:

  • Reviewing internal processes surrounding ElasticSearch Disaster Recovery
  • Researching additional mitigation strategies for regional ElasticSearch Disaster Recovery
Posted Jun 16, 2023 - 17:00 EDT

Resolved
Kustomer has received an update from our third party vender ElasticCloud which has rolled out a fix to the issue that has been affecting searches and reporting in the platform. The team is seeing the platform recovered though it is worth noting that larger searches will take longer to reindex. Duplicating a search should allow you to see proper results immediately. Please reach out to support if you have any further questions.
Posted Jun 09, 2023 - 20:42 EDT
Update
Kustomer has received updates from our third party vender ElasticCloud which has identified the issue impacting their service and they are currently working toward a resolution. Please expect further updates within 30 minutes.
Link to ElasticCloud Status Page: https://status.elastic.co/
Posted Jun 09, 2023 - 20:13 EDT
Monitoring
Kustomer is aware of an event reported by one of our third party vendors affecting POD1 that is causing Kustomer Searches to not return results within the platform and also causing issues with returning results for reporting. Our team will continue to monitor the status with our vendor. Please expect further updates within 30 minutes and reach out to support if you have additional questions or concerns.
Link to ElasticCloud Status Page: https://status.elastic.co/
Posted Jun 09, 2023 - 19:37 EDT
This incident affected: Prod1 (US) (Analytics, Search).