On February 8, 2024 1:20pm EST, Kustomer’s engineering team saw a large spike in 502 errors from the KIQ Customer Assist service in prod1. These errors were subsequently tied to customer reports of early transfers occurring in conversational assistants. After some adjustments to the resources available in this service and a precautionary rollback of an earlier deployment, the system had fully recovered and errors subsided completely by 3:20pm EST.
Root Cause
The root cause for the spike in errors is caused by performance of the service that required more resources but the service had been operating at maximum capacity. Inefficiencies were identified in certain high traffic endpoints in this incident which used more resources than normal.
This issue was only present in the prod1 environment and prod2 was unaffected.
The engineering team has also determined that the deployment was unrelated to the issue that customers faced during this incident.
2/8/24 10:58am EST - Deployment reaches all production environments.
2/8/24 12:29pm EST - Initial spike in 502 Bad Gateway errors in KIQ Customer Assist service for prod1 detected in monitoring systems. Prod2 is unaffected.
2/8/24 12:38pm EST - Received customer reports about early transfer in various conversational assistants.
2/8/24 12:50pm EST - Issue escalated to on-call engineering team.
2/8/24 1:48pm EST - Precautionary rollback of deployment completed.
2/8/24 2:09pm EST - Scaled up resources for KIQ Customer Assist service.
2/8/24 3:20pm EST - System fully recovers with no remaining errors.
Lessons/Improvements