On May 1, 2020, beginning at 4:06 PM ET, Kustomer chat became partially unavailable for both Kustomer production instances. This outage persisted for 35 minutes until a rollback of the impacted code was triggered at 4:40 PM ET.
During the incident window, customers were unable to create new chat sessions. Customers attempting to do so were presented with an error in the Kustomer chat SDK upon attempting to start a new chat. Both the US and EU data centers were affected. The impact was on new chat sessions only, as customers were still able to send and receive messages for existing chat sessions, i.e. those that had been created before the start of the incident at 4:06 PM ET.
At 4:04 PM ET, a code change to our chat API was released to our production environments.
Kustomer was alerted to the errors at 4:39 PM ET by reports of customers unable to begin a new chat session. These reports were immediately escalated to the engineering team.
At 4:40 PM ET, the Kustomer engineering team rolled back the offending code in our production chat services. Once this rollback was complete, full functionality was restored to Kustomer chat.
[IN PROGRESS] Expand alerting escalation policies to more readily surface system-critical errors in Kustomer chat API.
[IN PROGRESS] Increase automated end-to-end testing coverage to catch similar types of errors in lower environments before production release.