On 13 September 2021 between 3:10 PM and 4:05 PM ET, Gmail saw a burst of traffic from one client. This surge caused the Gmail Service API containers to crash and interfered with two consecutive half-hour account token refresh polls. Because Gmail auth tokens have a TTL of one hour, a number of accounts were temporarily suspended (between 4:00 pm and 4:30 pm ET) until the next poll could run at 4:30 pm ET. After the successful automated re-auth at 4:30, dropped messages were re-driven.
Between 3:10 pm and 4:05 pm ET, the Gmail-message-receiver received an influx of events from one client. The downstream Gmail service was overwhelmed and consequently crashed, returning 502s when the receiver attempted to make calls to it. This triggered a vicious cycle, putting even more pressure on the Gmail service, exacerbating the problem. By 4:05 pm ET inbound traffic had returned to normal levels and the service reached a stable state. However, the scheduled 3:30 pm and 4:00 pm ET half-hour account refresh polls were interrupted. Gmail auth tokens have a TTL of one hour. Consequently, some accounts’ last refresh was at 3:00 PM, meaning their credentials expired at 4:00 PM. At 4:30 pm ET the automated job successfully refreshed all the previously-disabled accounts. We re-drove those messages that failed to send in the interim.
While most messages were re-driven, for accounts that were suspended or hit rate-limit errors, all messages had to be manually resent. There were a few hundred messages that affected a handful of accounts.
Incoming messages were automatically re-fetched for these accounts in the next five-minute or one-hour poll. The net effect was that some outbound and inbound messages were delayed, and some outbound messages had to be re-sent manually. Affected clients were notified directly.
Our Gmail integration system was not able to handle unexpected surges of traffic. This was caused by the excess load of the additional volume of inbound messages placed on the Gmail service that caused it to repeatedly crash. This in turn prevented other accounts from sending and receiving mail but also interfered in the regularly scheduled account refresh, which caused account suspension.
We should look into better alerting for Gmail account refresh failures