On July 27th 2023 at 11:32am ET, several components of the Kustomer platform became unavailable for organizations hosted in our Prod1 (US) instance. This was caused by the indirect removal of records within a single table in the database cluster that holds customer data due to an incorrectly applied automation script. The Kustomer engineering team was immediately notified and began work on restoring normal operation. At 2:40pm, the customers database was restored from a snapshot but was still operating with degraded performance. The platform began operating at full performance at 6:31pm ET, and the team shifted to re-enabling all integrations and restoring data that was unavailable due to the outage. All integrations were enabled by 9:02pm, ending 9 hours and 18 minutes of system impact with full functionality being restored. By 9:35am on July 28th, all customer records created during the outage were restored, and by 10:10pm on July 28th all backend data integrations from other systems and automations that had been unable to run during the outage were re-run to restore data. Over the weekend, the Kustomer team continued to monitor the health of the platform and identified and resolved several smaller data issues impacting a small subset of customers created during the original outage. These were fully resolved by 12:30pm on July 31st. Updates to customer records on July 27th between 10:20am - 11:40am may have been impacted. Data from client-side integrations during the incident, such as Amazon Connect, were not able to be fully restored.
Our team performed a routine database migration to expand the capacity of our customers database with zero downtime which completed on July 21st. As part of the cleanup process initiated on July 27th to remove the older database table, a step in the process was not completed, and as a result, a subsequent step to cleanup the old database table resulted in deletion transactions being replicated to the new cluster. This rendered timelines and customer records inaccessible until the data was restored. The database backup restoration was delayed due to a series of challenges including issues with our database vendor’s restoration processes.
07/27 11:29am ET - Customer records became inaccessible, resulting in error messages in the Kustomer platform.
07/27 11:32pm - The issue is reported to the Kustomer engineering team and they begin investigating.
07/27 12:08pm - The problem is identified and the team begins working on initiating a database restore. The initial restore begins 14 minutes later but stalls.
07/27 12:30pm - Kustomer engineers initiate discussions with our database vendor to diagnose the problems with the restore operation.
07/27 2:41 - The restored data becomes partially available, but the team encounters additional vendor related challenges during the restore which resulted in further delays.
07/27 6:31pm - Database full restore completes and the platform begins operating normally, with the exception of 404 errors when referencing customers created during the outage and prior to the restore.
07/27 9:02pm - Kustomer engineers validate that the platform is operating normally, processing automations and incoming data. At this point, with the incident resolved, the team begins to focus on monitoring to ensure the system continues to operate properly and start working through data repair.
07/28 9:30am - Customer records that were created during the outage are recreated in the system, and Kustomer engineers continue data repair efforts.
07/28 12:00pm - The Kustomer platform experiences high latency and error rate for a 10 minute period due to high load from data restoration efforts.
07/28 ~5:00pm - Searches experienced a period of high latency and occasional errors due to an unrelated incident. Kustomer will be publishing a separate post-mortem for this event.
07/28 10:10pm - All data records and automations fully restored.
07/29 10:14pm - Kustomer engineers finalize repairs to duplicate customer records created as part of the initial cleanup process.