Summary
On October 14th, 2021, from 8:30AM EST - 8:40AM EST, customers in the Prod2 Pod were unable to access the Kustomer platform due to a failed deployment. The platform would have been inaccessible and inbound messages would have been dropped during this period.
Root Cause
A failed deployment in one of our core services that handles all traffic external to our servers rendered the core service unavailable during this time period. Because requests made from the web platform are all external to our servers, the platform was inaccessible during this time.
Timeline
10/13 11:33AM - Deployment to the core service fails. Notification appears in slack but is missed. All operations are normal given the pre-existing healthy instances of this service.
10/13 - 10/14 - Overnight the instances naturally are set to be replaced. Prod2 Pod instances fail to be replaced because of failed deployment
10/14 8:30AM - Engineering team is alerted of Prod2 Pod core service errors
10/14 8:30AM - 8:40AM - With no instances available to accept traffic, the platform is rendered unusable during this time. Our engineers identify the problem, rollback the change and the platform is fully available and functional.
Lessons/Improvements