[Conversational Assistants] Conversational Assistants Transferring Early on Prod 1
Incident Report for Kustomer
Postmortem

Post Mortem: Early transfers in KIQ Customer Assist (formerly Conversational Assistant) in prod1

Summary

On February 8, 2024 1:20pm EST, Kustomer’s engineering team saw a large spike in 502 errors from the KIQ Customer Assist service in prod1. These errors were subsequently tied to customer reports of early transfers occurring in conversational assistants. After some adjustments to the resources available in this service and a precautionary rollback of an earlier deployment, the system had fully recovered and errors subsided completely by 3:20pm EST.

Root Cause

The root cause for the spike in errors is caused by performance of the service that required more resources but the service had been operating at maximum capacity. Inefficiencies were identified in certain high traffic endpoints in this incident which used more resources than normal.

This issue was only present in the prod1 environment and prod2 was unaffected.

The engineering team has also determined that the deployment was unrelated to the issue that customers faced during this incident.

Timeline

2/8/24 10:58am EST - Deployment reaches all production environments.

2/8/24 12:29pm EST - Initial spike in 502 Bad Gateway errors in KIQ Customer Assist service for prod1 detected in monitoring systems. Prod2 is unaffected.

2/8/24 12:38pm EST - Received customer reports about early transfer in various conversational assistants.

2/8/24 12:50pm EST - Issue escalated to on-call engineering team.

2/8/24 1:48pm EST - Precautionary rollback of deployment completed.

2/8/24 2:09pm EST - Scaled up resources for KIQ Customer Assist service.

2/8/24 3:20pm EST - System fully recovers with no remaining errors.

Lessons/Improvements

  • [DONE] Provisioned more resources for the affected service
  • [IN PROGRESS] Reduce the page size for the /v1/assistants endpoint to 50
  • [PLANNED] Overhaul the service to improve efficiency, scalability, and performance
Posted Mar 05, 2024 - 15:29 EST

Resolved
Kustomer has resolved an event affecting conversational assistants on PROD 1 causing assistants to transfer early at higher rates. After careful monitoring, our team has found that all affected areas are operating at normal levels. Please reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted Feb 08, 2024 - 16:45 EST
Monitoring
Kustomer has implemented remediation measures to address error rates causing conversational assistants to transfer early. Our team will continue to monitor this update to ensure the issue is fully resolved. Please expect further updates within 1 hour and please reach out to support@kustomer.com if you have additional questions or concerns.
Posted Feb 08, 2024 - 15:44 EST
Update
Kustomer's team is monitoring error rates and investigating issues with conversational assistants transferring early within the platform. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect further updates within 1 hour and reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted Feb 08, 2024 - 15:12 EST
Update
Kustomer's team is still investigating the issues that may cause conversational assistants to transfer early within the platform. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect further updates within 1 hour and reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted Feb 08, 2024 - 14:10 EST
Investigating
Kustomer is aware of an event affecting Conversational Assistants in PROD 1 that may cause conversational assistants to transfer to agents before fully executing. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect further updates within 30 minutes and reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted Feb 08, 2024 - 13:34 EST
This incident affected: Prod1 (US) (Channel - Chat).