[NATIVE WHATSAPP] Whatsapp messages delayed sending [PROD1]

Incident Report for Kustomer

Postmortem

Summary

On Monday, June 6th, drafts for Whatsapp experienced a significant delay in sending. This led to an incident where messages were not delivered before attached media items expired.

Root Cause

Our WhatsApp service experienced an overload due to a significant surge in WhatsApp messages. The service was scaling, but it couldn't keep pace with the sudden demand, resulting in elevated latency and timeouts. This, in turn, initiated retries within our service, intensifying the message load and, occasionally, generating duplicate messages. In some cases, the service timed out, but the draft creation was still successful - which caused the same messages to be retried and led to duplicate messages. Consequently, both the Drafts service and WhatsApp service on prod1 experienced considerable spikes in memory and CPU usage.

In addition, WhatsApp was returning errors about media items in some of the messages. This was due to the increased latency - the media item in some messages had expired before the message could be sent. Which also caused some additional retries and exacerbated the issue.

Timeline

Jun 6, 2025

1:33 PM EST Incident created.

1:41 PM EST Began investigating recent releases in WhatsApp and other related services.

1:53 PM EST  Discovered spikes in WhatsApp service, not code change related.

4:03 PM EST Deployed scaling changes to WhatsApp service,  spikes settled down.

4:07 PM EST Created a change to reduce the rate limit in Drafts service for WhatsApp.

7:59 PM EST Deployed rate limit change; traffic returned to healthy levels.

Lessons/Improvements

  • Duplicate Drafts Investigation - Understand why duplicate WhatsApp drafts occurred during the incident.

    • Status: Done
  • Scaling Enhancements - Increased scaling for WhatsApp service to better handle message bursts.

    • Status: Done
  • Adjusted Rate Limit - Decreased WhatsApp rate limit from 400/minute to 300/minute in Drafts service.

    • Status: Done
  • Media Expiration - Investigate expiration on media items and determine if it can be extended beyond that.

    • Status: In Progress
Posted Jun 13, 2025 - 15:37 EDT

Resolved

Kustomer has resolved an event affecting Native WhatsApp Channel in PROD 1 that caused outbound messages to delay in sending.

After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support at Chat or Email if you have additional questions or concerns.
Posted Jun 06, 2025 - 17:01 EDT

Monitoring

Kustomer has implemented an update to address an event affecting Native Whatsapp that may have caused delays with delivery of outbound messages.

Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Jun 06, 2025 - 16:10 EDT

Update

Kustomer has identified an event affecting Whatsapp (PROD1) that may cause delayed sending

Our team continues to work on implementing a resolution. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Jun 06, 2025 - 16:01 EDT

Update

Kustomer has identified an event affecting Whatsapp (PROD1) that may cause delayed sending

Our team is continuing to work on implementing a resolution. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Jun 06, 2025 - 15:32 EDT

Identified

Kustomer has identified an event affecting Whatsapp (PROD1) that may cause delayed sending

Our team is currently working to implement a resolution. Please expect further updates within the next 30 minutes, and reach out to Kustomer support at support@kustomer.com if you have additional questions or concerns.
Posted Jun 06, 2025 - 15:04 EDT

Update

Kustomer is aware of an event affecting Native Whatsapp that may cause outbound messages to delay sending

Our team is still working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at support@kustomer.com for any further questions or updates.
Posted Jun 06, 2025 - 14:34 EDT

Update

Kustomer is aware of an event affecting Native Whatsapp that may cause outbound messages to delay sending

Our team is continuing to work to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at support@kustomer.com for any further questions or updates.
Posted Jun 06, 2025 - 14:05 EDT

Update

Kustomer is aware of an event affecting Native Whatsapp that may cause outbound messages to delay sending

Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at support@kustomer.com for any further questions or updates.
Posted Jun 06, 2025 - 13:42 EDT

Investigating

Kustomer is aware of an event affecting Native Whatsapp that may cause outbound messages to delay sending

Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at support@kustomer.com for any further questions or updates.
Posted Jun 06, 2025 - 13:41 EDT
This incident affected: Third Party (Channel - WhatsApp).