On Monday, June 6th, drafts for Whatsapp experienced a significant delay in sending. This led to an incident where messages were not delivered before attached media items expired.
Our WhatsApp service experienced an overload due to a significant surge in WhatsApp messages. The service was scaling, but it couldn't keep pace with the sudden demand, resulting in elevated latency and timeouts. This, in turn, initiated retries within our service, intensifying the message load and, occasionally, generating duplicate messages. In some cases, the service timed out, but the draft creation was still successful - which caused the same messages to be retried and led to duplicate messages. Consequently, both the Drafts service and WhatsApp service on prod1 experienced considerable spikes in memory and CPU usage.
In addition, WhatsApp was returning errors about media items in some of the messages. This was due to the increased latency - the media item in some messages had expired before the message could be sent. Which also caused some additional retries and exacerbated the issue.
Jun 6, 2025
1:33 PM EST Incident created.
1:41 PM EST Began investigating recent releases in WhatsApp and other related services.
1:53 PM EST Discovered spikes in WhatsApp service, not code change related.
4:03 PM EST Deployed scaling changes to WhatsApp service, spikes settled down.
4:07 PM EST Created a change to reduce the rate limit in Drafts service for WhatsApp.
7:59 PM EST Deployed rate limit change; traffic returned to healthy levels.
Duplicate Drafts Investigation - Understand why duplicate WhatsApp drafts occurred during the incident.
Scaling Enhancements - Increased scaling for WhatsApp service to better handle message bursts.
Adjusted Rate Limit - Decreased WhatsApp rate limit from 400/minute to 300/minute in Drafts service.
Media Expiration - Investigate expiration on media items and determine if it can be extended beyond that.