Timelines and API not returning data in Prod 1
Incident Report for Kustomer
Postmortem

Summary

On July 27th 2023 at 11:32am ET, several components of the Kustomer platform became unavailable for organizations hosted in our Prod1 (US) instance.  This was caused by the indirect removal of records within a single table in the database cluster that holds customer data due to an incorrectly applied automation script. The Kustomer engineering team was immediately notified and began work on restoring normal operation.  At 2:40pm, the customers database was restored from a snapshot but was still operating with degraded performance.  The platform began operating at full performance at 6:31pm ET, and the team shifted to re-enabling all integrations and restoring data that was unavailable due to the outage.  All integrations were enabled by 9:02pm, ending 9 hours and 18 minutes of system impact with full functionality being restored.  By 9:35am on July 28th, all customer records created during the outage were restored, and by 10:10pm on July 28th all backend data integrations from other systems and automations that had been unable to run during the outage were re-run to restore data.  Over the weekend, the Kustomer team continued to monitor the health of the platform and identified and resolved several smaller data issues impacting a small subset of customers created during the original outage.  These were fully resolved by 12:30pm on July 31st.  Updates to customer records on July 27th between 10:20am - 11:40am may have been impacted.  Data from client-side integrations during the incident, such as Amazon Connect, were not able to be fully restored.

Root Cause

Our team performed a routine database migration to expand the capacity of our customers database with zero downtime which completed on July 21st.  As part of the cleanup process initiated on July 27th to remove the older database table, a step in the process was not completed, and as a result, a subsequent step to cleanup the old database table resulted in deletion transactions being replicated to the new cluster.  This rendered timelines and customer records inaccessible until the data was restored.  The database backup restoration was delayed due to a series of challenges including issues with our database vendor’s restoration processes.

Timeline

07/27 11:29am ET - Customer records became inaccessible, resulting in error messages in the Kustomer platform.

07/27 11:32pm - The issue is reported to the Kustomer engineering team and they begin investigating.

07/27 12:08pm - The problem is identified and the team begins working on initiating a database restore.  The initial restore begins 14 minutes later but stalls.

07/27 12:30pm - Kustomer engineers initiate discussions with our database vendor to diagnose the problems with the restore operation.

07/27 2:41 - The restored data becomes partially available, but the team encounters additional vendor related challenges during the restore which resulted in further delays.

07/27 6:31pm - Database full restore completes and the platform begins operating normally, with the exception of 404 errors when referencing customers created during the outage and prior to the restore.

07/27 9:02pm - Kustomer engineers validate that the platform is operating normally, processing automations and incoming data.  At this point, with the incident resolved, the team begins to focus on monitoring to ensure the system continues to operate properly and start working through data repair.

07/28 9:30am - Customer records that were created during the outage are recreated in the system, and Kustomer engineers continue data repair efforts.

07/28 12:00pm - The Kustomer platform experiences high latency and error rate for a 10 minute period due to high load from data restoration efforts.

07/28 ~5:00pm - Searches experienced a period of high latency and occasional errors  due to an unrelated incident.  Kustomer will be publishing a separate post-mortem for this event.

07/28 10:10pm - All data records and automations fully restored.

07/29 10:14pm - Kustomer engineers finalize repairs to duplicate customer records created as part of the initial cleanup process.

Lessons/Improvements

  • Database restore functionality and disaster recovery process creation - The database restore took significantly longer than necessary due to a number of issues related to vendor specific configurations and limitations. We are working closely with our database vendor to investigate and implement alternative database restore functionality and disaster recovery processes with a goal of significantly minimizing time to restore.
  • Implement technical controls as additional layers of protection in our data migration process - We are working to automate more of our database migration processes to encode safety checks and minimize the possibility of human error.
  • Close monitoring gaps - It took a few minutes to be notified of issues with the platform. We are addressing some gaps in our monitoring that will allow us to assess impact to systems faster in the case of future incidents.
  • Strengthen Documentation - Although our processes were well documented, there is room to improve documentation further. We are updating our documentation and adding training material for the engineering team on best practices for restoring data after an incident without interrupting service.
  • Resiliency and Data Recovery - Client-side integrations do not have the same level of guarantees as our standard backend channel & application integrations. We are looking at ways to improve our Amazon Connect Integration to allow for greater resiliency and data recovery in the case of service interruptions.
Posted Aug 02, 2023 - 16:24 EDT

Resolved
While we are still investigating some minor issues, all access to the Kustomer platform has been restored. For any questions related to this incident, please reach out to Kustomer Support via Chat or at support@kustomer.com
Posted Jul 28, 2023 - 09:47 EDT
Update
Our engineers are continuing to ensure efficient cleanup efforts and are working closely to ensure precision in our execution.
Posted Jul 28, 2023 - 08:38 EDT
Update
Data integrity checks and clean-up efforts are still continuing in the background. Please expect further updates and reach out to Support at support@kustomer.com if you have additional questions or concerns.
Posted Jul 28, 2023 - 05:25 EDT
Update
Issues related to timelines, searches, and other areas in the platform are now resolved. Data integrity checks and clean up efforts will continue in the background.
Posted Jul 28, 2023 - 02:45 EDT
Update
The Kustomer team is continuing to resolve errors related to timelines, searches, and other areas of the platform, resulting from an extended event that impacted the platform. Affected data is being identified, grouped, checked against our database, restored and re-driven or reimported where needed.
Posted Jul 27, 2023 - 23:18 EDT
Update
An issue affecting Kustomer's ability to receive and display data has been resolved, and is being monitored. Errors may still be encountered as our team works to reimport data affected during the incident. Please look for the next update within an hour.
Posted Jul 27, 2023 - 22:02 EDT
Monitoring
All channels and systems have been restored to full functionality. As we monitor the this resolution, errors may still continue to present in various areas, including timelines, searches, and workflows. Our team is working to clean up these remaining residual issues as we reimport data affected during the incident. Please look for additional updates within the next hour.
Posted Jul 27, 2023 - 21:02 EDT
Update
As recovery of channels is underway and eventually completes, we do expect some errors to appear in timelines, searches, and workflows. We expect those errors to resolve when the clean-up of data during the period of outage fully completes. That clean up will be prioritized once the backlog of data has been processed and the system maintains stability.
Posted Jul 27, 2023 - 20:03 EDT
Update
Timelines and searches are loading as expected. Our team is starting to turn channels back on. Please expect further updates in 1 hour
Posted Jul 27, 2023 - 19:06 EDT
Update
Timelines and searches are loading more consistently. However, they are still in recovery. As systems scale we expect improvements. Please expect further updates within 1 hour.
Posted Jul 27, 2023 - 18:07 EDT
Update
Timelines and searches are starting to load more consistently. Please expect further updates within 1 hour.
Posted Jul 27, 2023 - 17:04 EDT
Update
Our team is currently scaling services to meet request traffic within the web app to resolve continued issues connecting to customer timelines and loading searches. Please expect further updates within 1 hour
Posted Jul 27, 2023 - 16:00 EDT
Update
Our team is currently working on restoring connectivity to timelines and re-indexing searches with the latest updates. Upon completion, we will restore channel connections. Please expect further updates within 1 hour.
Posted Jul 27, 2023 - 15:01 EDT
Update
Engineering is currently taking measures to restore connectivity in prod1. Channels have been temporarily turned off to prioritize the recovery of the affected system. There is currently no ETA yet for recovery, and will provide one as soon as possible. Please expect further updates within 1 hour
Posted Jul 27, 2023 - 14:04 EDT
Update
Engineering is currently taking measures to restore connectivity in prod1. Channels have been temporarily turned off to prioritize the recovery of the affected system. There is currently no ETA yet for recovery but will provide one as soon as possible. Please expect further updates within 1 hour
Posted Jul 27, 2023 - 13:50 EDT
Update
The engineering team has identified the issue affecting the connectivity to the platform and are working towards recovery. Please expect further updates within 30 minutes and reach out to support@kustomer.com if you have any questions.
Posted Jul 27, 2023 - 13:11 EDT
Update
Our team is continuing to work to resolve the issue with timelines not loading and the API not returning data. Please expect further updates within 30 minutes and reach out to support@kustomer.com if you have any questions.
Posted Jul 27, 2023 - 12:40 EDT
Identified
Kustomer is aware of an event affecting Timelines and API that may cause errors when loading customer timelines within the platform. Our team is currently working to implement a resolution. Please expect further updates within 30 minutes and reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted Jul 27, 2023 - 12:08 EDT
Update
We are continuing to investigate this issue.
Posted Jul 27, 2023 - 11:53 EDT
Investigating
Kustomer is investigating an event affecting Timelines and the API that may cause errors when loading the app in Prod1. Our team is investigating this issue. Please expect further updates within 30 Minutes and reach out to support at support@kustomer.com if you have additional questions or concerns.
Posted Jul 27, 2023 - 11:44 EDT
This incident affected: Prod1 (US) (API, Channel - Chat, Channel - Email, Channel - Facebook, Channel - Instagram, Channel - SMS, Channel - Twitter, Channel - WhatsApp, Web Client).