Summary

Platform latency in prod1 was caused by a failure to connect properly with an external caching service. Incorrect configuration caused a service to exceed a maximum number of connections allowed at one time.

Root Cause

A new service introduced new connection logic to our caching service that was not initialized correctly. Instead of initializing once, and then sharing that connection for all events processed, it initialized a new connection every time an event was handled. When the service was scaled up and out to process the full amount of traffic, enough connections were opened to reach the maximum allowed by the caching service, so other parts of the platform failed to connect.

Timeline

06/16 4:40 pm - Introduced incorrect cache connection logic was deployed to production. Connections increased but not enough to cause latency.

06/17 12:40 pm - Scaled the service in production.

06/17 12:52 pm - Kustomer engineering team receives a number of internal alerts about failures to connect to the caching service in prod1.

06/17 12:54 pm - Platform latency and issues loading customer timelines is reported internally for prod1.

06/17 12:55 pm - Connections to cache are reset as a first aid measure. Some services start to recover but then decline again.

06/17 01:08 pm - The root cause is identified as caching misconfiguration and a rollback initiated.

06/17 01:25 pm - Latency considered restored to normal.

‌

Lessons/Improvements

[DONE] Additional alerts will be added to better monitor connections to the cache.
[DONE] Cache connection logic will be fixed in service.

Posted Jul 27, 2022 - 11:44 EDT

Resolved

This incident has been resolved. If you have questions contact support@kustomer.com.

Posted Jun 17, 2022 - 14:16 EDT

Monitoring

A fix has been implemented to the caching service and we are monitoring the results.

Posted Jun 17, 2022 - 13:20 EDT

Identified

Increased latency in caching of objects is causing search to perform sub-optimally

Posted Jun 17, 2022 - 13:09 EDT

Investigating

Kustomer is aware of an event affecting loading timelines in prod 1 that may cause temporary latency within the platform. Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect further updates within 30 minutes and reach out to support if you have additional questions or concerns.

Posted Jun 17, 2022 - 13:05 EDT

This incident affected: Prod1 (US) (Analytics, API, Bulk Jobs, Channel - Chat, Channel - Email, Channel - Facebook, Channel - Instagram, Channel - SMS, Channel - Twitter, Channel - WhatsApp, CSAT, Events / Audit Log, Exports, Notifications, Registration, Search, Tracking, Web Client, Web/Email/Form Hooks, Workflow, Knowledge base).