Issue with Gmail Authentication
Incident Report for Kustomer
Postmortem

Summary

On 13 September 2021 between 3:10 PM and 4:05 PM ET, Gmail saw a burst of traffic from one client.  This surge caused the Gmail Service API containers to crash and interfered with two consecutive half-hour account token refresh polls.  Because Gmail auth tokens have a TTL of one hour, a number of accounts were temporarily suspended (between 4:00 pm and 4:30 pm  ET) until the next poll could run at 4:30 pm ET.  After the successful automated re-auth at 4:30, dropped messages were re-driven.  

What happened

Between 3:10 pm and 4:05 pm ET, the Gmail-message-receiver received an influx of events from one client.  The downstream Gmail service was overwhelmed and consequently crashed, returning 502s when the receiver attempted to make calls to it.  This triggered a vicious cycle, putting even more pressure on the Gmail service, exacerbating the problem.  By 4:05 pm ET inbound traffic had returned to normal levels and the service reached a stable state.  However, the scheduled 3:30 pm and 4:00 pm ET half-hour account refresh polls were interrupted. Gmail auth tokens have a TTL of one hour.  Consequently, some accounts’ last refresh was at 3:00 PM, meaning their credentials expired at 4:00 PM.  At 4:30 pm ET the automated job successfully refreshed all the previously-disabled accounts.  We re-drove those messages that failed to send in the interim.  

Impact

While most messages were re-driven, for accounts that were suspended or hit rate-limit errors, all messages had to be manually resent. There were a few hundred messages that affected a handful of accounts. 

Incoming messages were automatically re-fetched for these accounts in the next five-minute or one-hour poll.  The net effect was that some outbound and inbound messages were delayed, and some outbound messages had to be re-sent manually.  Affected clients were notified directly.

Technical details

Our Gmail integration system was not able to handle unexpected surges of traffic.   This was caused by the excess load of the additional volume of inbound messages placed on the Gmail service that caused it to repeatedly crash.  This in turn prevented other accounts from sending and receiving mail but also interfered in the regularly scheduled account refresh, which caused account suspension.

Lessons & Action Items

  • This incident plus similar later ones has prompted us to begin re-architecting the Gmail integration (60% complete; work is ongoing)
  • We should look into better alerting for Gmail account refresh failures

    • Add alerting for failed account refresh (Complete)
    • We should schedule account refresh jobs more frequently. (Complete)
Posted Jan 10, 2022 - 15:40 EST

Resolved
The issue with Gmail authentication have been resolved. This was caused by a spike in email volume that contributed to an over-utilization of resources in a very short period of time.

Gmail email processing should be back to normal now. Messages that did not send, will be re-driven by Kustomer. Incoming messages should not have been impacted by this issue. We will be providing a detailed postmortem on the issue.

If you have additional questions, please reach out to our Support team here: https://kustomer.kustomer.help/contact/contact-support-Bk17VI8aU.
Posted Sep 13, 2021 - 19:28 EDT
Investigating
Kustomer is currently experiencing an issue with Gmail authentication. We are working to resolve this issue as quickly as possible. During this time, you may experience not being able to send and reply to emails within the Kustomer platform.

Please reach out to our Kustomer support team with any additional questions. You can reach us at https://help.kustomer.com/ and click on "Contact Support" at the top of the page.
Posted Sep 13, 2021 - 17:23 EDT
This incident affected: Prod1 (US) (Channel - Email) and Prod2 (EU) (Channel - Email).