Platform latency in prod1 was caused by a failure to connect properly with an external caching service. Incorrect configuration caused a service to exceed a maximum number of connections allowed at one time.
Root Cause
A new service introduced new connection logic to our caching service that was not initialized correctly. Instead of initializing once, and then sharing that connection for all events processed, it initialized a new connection every time an event was handled. When the service was scaled up and out to process the full amount of traffic, enough connections were opened to reach the maximum allowed by the caching service, so other parts of the platform failed to connect.
06/16 4:40 pm - Introduced incorrect cache connection logic was deployed to production. Connections increased but not enough to cause latency.
06/17 12:40 pm - Scaled the service in production.
06/17 12:52 pm - Kustomer engineering team receives a number of internal alerts about failures to connect to the caching service in prod1.
06/17 12:54 pm - Platform latency and issues loading customer timelines is reported internally for prod1.
06/17 12:55 pm - Connections to cache are reset as a first aid measure. Some services start to recover but then decline again.
06/17 01:08 pm - The root cause is identified as caching misconfiguration and a rollback initiated.
06/17 01:25 pm - Latency considered restored to normal.
Lessons/Improvements