Search is not updating with new information
Incident Report for Kustomer
Postmortem

Summary

On November 25, 2020, at approximately 9:00 AM EST, the Kustomer team started to receive multiple reports about searches not updating. We quickly discovered problems involving Kinesis and Cloudwatch, stemming from a larger AWS outage. 

After determining the immediate impact, patches were deployed to continuously process data. We were able to restore some functionality by 9:40 AM EST.  The platform started seeing recovery at around 10:00 PM EST and fully recovered by November 26, 2020, at 12:30 PM EST.

Impact / Alerts

Services impacted:

  • Search
  • Reporting
  • Data Stream
  • Snowflake
  • Scheduled cleaners
  • Audit logs
  • Unable to auto-scale services
  • Workflows

Root Cause

The root cause was identified to be an outage at AWS in their Kinesis and Cloudwatch services.

See AWS post mortem: https://aws.amazon.com/message/11201

The incident started on November 25, 2020, at approximately 9:00 AM EST. After assessing the impact on the Kustomer platform, we were able to keep the system operational in a degraded state by deploying various patches to substitute the functionality from Kinesis being down. Additionally, because Cloudwatch provides the triggers to help the platform scale up based on load, we manually added more resources to handle any unexpected spike in traffic. 

During the outage, we encountered an additional issue with one of the automated patches we deployed, that was responsible for updating search indexes. At 5:06 PM EST, we attempted to import a large batch of data that overwhelmed our database, resulting in high replication lag and prevented database queries from finishing and retrieving data. We discovered where the problem was, aborted the operation at 6:05 PM EST, and immediately started the search reindex process again.

At approximately 10:00 PM EST, we saw that Kinesis started to recover. We kept the patches in place until the outage was confirmed to be resolved by AWS on November 26, 2020 at 4:00 AM EST. We spent the next several hours processing backlogged data in our queues and the system was fully operational by 12:30 PM EST.

Resolution

In order to keep processing data, we deployed 3 automated scripts running on a schedule. These scripts were responsible for updating search and reporting features, keeping data in sync, and triggering scheduled events to keep updating conversations.

We manually increased the counts of our services proactively to accommodate the loss of auto-scaling abilities.

Lessons/Improvements

  • Given that this outage only affected AWS us-east-1 region where most of our services are located, moving forward, the engineering team will be reviewing and pursuing a more resilient multi-region strategy to fall back on. 
  • The engineering team is actively working on the database issue encountered during data import by reducing its overall size to increase performance.
Posted Dec 04, 2020 - 16:05 EST

Resolved
The issues with AWS service have been resolved.

Additionally, we have engineers working over the holiday to ensure full operating capacity of the platform and surrounding services. We recommend reviewing the Amazon status page for updates to their services as well. https://status.aws.amazon.com/

All services should be operational but if you are experiencing issues, please reach out to our Support team here: https://kustomer.kustomer.help/contact/contact-support-Bk17VI8aU.
Posted Nov 26, 2020 - 12:36 EST
Update
As of 7:00 AM EST(4:00 AM PST) Amazon has restored all traffic to Kinesis data streams. Kustomers systems should feel operational. We are still monitoring all of our systems as we start to receive traffic from AWS.

We have engineers working over the holiday to ensure full recovery and operating capacity of the platform and surrounding services
Search was the main component that was impacted, but there can be other services that users might be experiencing a degraded performance during this recovery period such as data streams, i.e. Kinesis Data stream.

Please review the Amazon status page for updates to their services as well. https://status.aws.amazon.com/
Posted Nov 26, 2020 - 07:43 EST
Update
As of 7:00 AM EST(4:00 AM PST) Amazon has restored all traffic to Kinesis data streams. Kustomers systems should still feel operational. We are still monitoring all of our systems as we start to receive traffic from AWS.

Amazon CloudWatch metrics remain delayed until Kinesis throttles are fully lifted. We are able to confirm that data is flowing normally into Kustomers systems again but at a delayed and staggered rate as reflected by the AWS update. We added additional functionality to continuously update the search indexes and drive messages while the throttles are lifted by AWS.

We have engineers working over the holiday to ensure full recovery and operating capacity of the platform and surrounding services
Search was the main component that was impacted, but there can be other services that users might be experiencing a degraded performance during this recovery period such as data streams, i.e. Kinesis Data stream.

Search page indexing was temporarily halted from 5:58pm to 6:40pm EST due to an issue in a database process. The problem was resolved and indexing resumed immediately after. Users may still experience degraded performance in search pages while we’re working towards full recovery.

Please review the Amazon status page for updates to their services as well. https://status.aws.amazon.com/
Posted Nov 26, 2020 - 07:35 EST
Update
Amazon has provided an update that the errors within Kinesis has been fully mitigated but are only incrementally restoring traffic as to not overwhelm their systems. Amazon CloudWatch metrics remain delayed until Kinesis throttles are fully lifted. We are able to confirm that data is flowing normally into Kustomers systems again but at a delayed and staggered rate as reflected by the AWS update. We added additional functionality to continuously update the search indexes and drive messages while the throttles are lifted by AWS.

We have engineers working over the holiday to ensure full recovery and operating capacity of the platform and surrounding services
Search is the main component that has been impacted, but there can be other services that users might be experiencing a degraded performance with such as data streams, i.e. Kinesis Data stream.

Search page indexing was temporarily halted from 5:58pm to 6:40pm EST due to an issue in a database process. The problem was resolved and indexing resumed immediately after. Users may still experience degraded performance in search pages while we’re working towards full recovery.

We are monitoring the situation and will continue to share information here as it becomes available.

Please review the Amazon status page for updates to their services as well. https://status.aws.amazon.com/
Posted Nov 25, 2020 - 23:13 EST
Update
Kustomer is observing steady signs of improvement of error rates for Kinesis from the current AWS outage and we are back to normal levels in our Kinesis event processing.

We will continue to work with Amazon Web Services to return our services to fully operational. Search is the main component that is being impacted, but there can be other services that users might be experiencing a degraded performance with such as data streams, i.e. Kinesis Data stream.

Search page indexing was temporarily halted from 5:58pm to 6:40pm EST due to an issue in a database process. The problem was resolved and indexing resumed immediately after. Users may still experience degraded performance in search pages while we’re working towards full recovery.

We are monitoring the situation and will continue to share information here as it becomes available.

Please review the Amazon status page for updates to their services as well. https://status.aws.amazon.com/
Posted Nov 25, 2020 - 20:27 EST
Update
Kustomer is now driving all Gmail emails that might not have been created in Kustomer while disconnection occurred due to this AWS outage. We will continue to provide updates once this has been completed.

Kustomer has implemented internal updates to account for AWS outage issues we are seeing. Functionality of the Kustomer platform should feel operational but you may notice some degraded performance.

We have been informed that Amazon has identified the root cause of the Kinesis Data Streams API outage and is working towards a resolution. We are seeing an improvement in error rates for Kinesis, but per Amazon a full recovery for this service may still take up to a few hours.

Kustomer Search should feel operational but you may notice some degraded performance and up to a 5-minute delay during these AWS outages. We will continue to work with Amazon Web Services to return our services to fully operational. Search is the main component that is being impacted, but there can be other services that users might be experiencing a degraded performance with such as data streams, i.e. Kinesis Data stream.

We are monitoring the situation and will continue to share information here as it becomes available.

Please review the Amazon status page for updates to their services as well. https://status.aws.amazon.com/
Posted Nov 25, 2020 - 15:36 EST
Update
Kustomer has implemented internal updates to account for AWS outage issues we are seeing. Functionality of the Kustomer platform should feel operational but you may notice some degraded performance.

We have been informed that Amazon has identified the root cause of the Kinesis Data Streams API outage and is working towards a resolution. We will be working with Amazon Web Services to continue work towards returning our services to be fully operational. Search is the main component that is being impacted, but there can be other services that users might be experiencing a degraded performance with such as data streams, i.e. Kinesis Data stream.

Kustomer was able to re-index the search pages every 30 seconds of the last 5 minutes of search activity. Kustomer Search should feel operational but you may notice some degraded performance during these AWS outages.

Kustomer is working to re-drive Gmail emails that might not have been created in Kustomer while disconnection occurred due to this AWS outage

We are monitoring the situation and will continue to share information here as it becomes available.

Please review the Amazon status page for updates to their services as well. https://status.aws.amazon.com/
Posted Nov 25, 2020 - 14:24 EST
Monitoring
The Kustomer team has identified the issue to be an outage with an AWS service. We will be working with Amazon Web Services to continue work towards returning our services to operational. Search is the main component that is being impacted, but there can be other services that users might be experiencing a degraded performance with such as data streams, i.e. Kinesis Data stream.

Kustomer was able to re-index the search pages and Kustomer Search should feel operational but you may notice some degraded performance during these AWS outage.

Gmail emails authentication could have been affected. Kustomer was able to re-authenticate Gmail addresses that were disconnected. Kustomer is working to re-drive Gmail emails that might not have been created in Kustomer while disconnection occurred.

We are monitoring the situation and will continue to share information here as it becomes available. This is only affecting US servers.

Please review the Amazon status page for updates to their services as well. https://status.aws.amazon.com/
Posted Nov 25, 2020 - 12:27 EST
Update
The Kustomer team has identified the issue to be an outage with an AWS service. We will be working with Amazon Web Services to continue work towards returning our services to operational. Search is the main component that is being impacted, but there can be other services that users might be experiencing a degraded performance with such as data streams, i.e. Kinesis Data stream.

Gmail emails authentication could have been affected and will need to be re-authenticated. Kustomer was able to re-authenticate Gmail addresses that were disconnected. Kustomer is working to re-drive Gmail emails that might not have been created while disconnection occurred.

We are monitoring the situation and will continue to share information here as it becomes available. This is only affecting US servers.

Please review the Amazon status page for updates to their services as well. https://status.aws.amazon.com/
Posted Nov 25, 2020 - 11:48 EST
Update
The Kustomer team has identified the issue to be an outage with an AWS service. We will be working with Amazon Web Services to continue work towards returning our services to operational. Search is the main component that is being impacted, but there can be other services that users might be experiencing a degraded performance with such as data streams, Kinesis Data stream, Gmail emails authentication could have been affected and will need to be re-authenticated. We are working internally to address this issue.

We are monitoring the situation and will continue to share information here as it becomes available. This is only affecting US servers.

Please review the Amazon status page for updates to their services as well. https://status.aws.amazon.com/
Posted Nov 25, 2020 - 11:02 EST
Identified
The Kustomer team has identified the issue to be an outage with an AWS service. We are monitoring the situation. We will continue to share information here as it becomes available. This is only affecting US servers.

https://status.aws.amazon.com/
Posted Nov 25, 2020 - 09:37 EST
Investigating
Kustomer is currently experiencing issues with Search. We are working to resolve the issue as quickly as possible. During this time you may experience degrading performance with searches updating.

Please reach out to our Support team with any additional questions. You can reach us by going to https://help.kustomer.com/ and clicking "Contact Support" at the top of the page.
Posted Nov 25, 2020 - 09:23 EST
This incident affected: Prod1 (US) (Analytics, Channel - Email, Search, Workflow).