RDS connectivity issue on a virginia-aurora shared RDS.

Incident Report for Pagely

Postmortem

After more discussion with the database team at Amazon Web Services, we have a better understanding of the failure that occurred. The root cause was determined to be a rare bug within the Aurora database engine. The bug causes the mysql process to be unable to start up properly when an ALTER statement is interrupted by a DB instance reboot. This is why we were unable to launch new instances or new DB clusters from point-in-time recovery targets after the ALTER was issued, which contributed to the extended downtime you experienced. Amazon was able to correct the problem for the affected system and they have confirmed that this bug will be fixed in an upcoming RDS update. Pagely will apply the patch to all of our Aurora Database Clusters as soon as it becomes available.

In the interim, our DevOps team knows about this issue and we feel confident that it will not recur. The conditions for triggering this bug are very specific which we can account for and avoid rather easily. Our team has adopted new Standard Operating Procedures which takes this condition into account when interacting with our database clusters.

We appreciate your patience and understanding both during and following this event.

Posted Dec 09, 2020 - 02:54 UTC

Resolved

Summary

On Friday December 4, 2020, one of our managed Amazon Aurora Database Clusters, vps-virginia-aurora-3, experienced an extended outage lasting approximately three and a half hours. A relatively small portion of the overall sites hosted in this region reside on this DB cluster, that is to say that there was plenty of spare capacity at the time of the event. Affected sites experienced database connection errors throughout the duration of the event. The recovery point of your data when services were restored is approximately 15-70 minutes prior to the onset of the service interruption.

- Database services for the affected sites were unavailable between 9:15AM PST and 12:30PM PST.
- By 12:30PM PST, service availability was restored to a backup DB cluster with a 9:00AM PST point-in-time.
- By approximately 7:00PM PST, all restored sites were fully migrated to a brand new Aurora RDS DB Cluster and away from the problem system.

More Details

Our investigation is ongoing and we are working closely with the team at Amazon to fully understand the nature of this issue. Although database issues have happened in the past, they are usually resolved within a few minutes, not hours. An event of this nature had not occurred for us before. We need more time to investigate the matter with Amazon before we can say definitively what the cause was. Rest assured, both Pagely and Amazon are interested in finding a root cause so that a similar event can not occur in the future. We have already had some great discussions on mitigating this type of impact in the future, and we continue to work on determining a root cause.

We can tell you that the behavior we observed of this Aurora Cluster was not typical and it also got the attention of the database team at AWS who, independent of Pagely's investigation, noticed the DB cluster was behaving erratically and connected with us to let us know they're applying an emergency fix. Typical actions such as adding a reader instance, performing a failover, restarting a DB instance, were not working for this cluster until steps were taken by AWS to address a problem they were seeing.

While the issue was ongoing with the original DB cluster, Pagely Engineers were also launching new DB clusters with varying point-in-time recovery targets. This is a proactive step we will take if we feel the time it could take for a system to recover exceeds the time it may take to launch a new DB cluster with slightly older data. Our goal during these moments is to get sites running again as quickly as possible and with the last-known good set of data.

At a certain point in the incident, because things were taking so long, we told you we'd restore from older (less than 24hrs) SQL backups, but we actually were able to get a DB cluster launched with a fairly recent point-in-time recovery target (15-70 minutes old). After this recovery was performed, and with the assistance of AWS, the originally affected system was also brought back to an operational state. This system is currently under evaluation and is not currently powering any of your live sites. Migrations were performed to get all affected sites relocated to a completely different and newly built Aurora DB cluster. With that said, if you think you are missing any data please us know and we can provide you with a separate SQL dump from the affected system for manual examination.

We want to assure you that every step was taken to restore service availability as soon as possible and with the most current possible data set. Some of these operations take time to complete, even when everything is working correctly. When things are not working correctly, recovery timelines can be impacted further. We have a playbook we follow in these situations and we always try to think a few steps ahead. This typically leads to no or very little noticeable impact to your services, but then there are days like today. We always work to keep events of this severity a rarity, if not a faint memory, most of the time, and we thank you for your understanding as we worked to get things back to normal.

Posted Dec 05, 2020 - 05:29 UTC

Monitoring

At this time your sites are operational again.

Pagely Engineers have reinstated your databases to a very near point-in-time recovery. We did not need to resort to the older SQL backups. The first signs of problems occurred between 16:15 and 17:00 UTC, and we have restored your databases to a point-in-time backup at 16:00 UTC.

Further efforts are underway to migrate your databases to a final placement on one of our very newest DB clusters.
We will follow up shortly with additional information, including a root cause analysis and issuance of service credits. Thank you.

Posted Dec 04, 2020 - 20:40 UTC

Update

Pagely Engineers continue to wait for the most recent data sets in our restoration efforts to complete provisioning. At this time, we will begin restoring affected sites from your regular SQL backups. The age of this data is slightly older, but no more than 24 hours old. This is only being done because of the extended time it is taking to recover sites with more current data, we'd like to reinstate site availability as soon as possible. Our team will happily assist in providing more recent data after that is made available by the system.

Posted Dec 04, 2020 - 20:08 UTC

Update

Pagely Engineers are still working to restore availability of the databases for all affected sites. We sincerely apologize for the extended delay, a full post-mortem will be provided after services are restored.

Our team is currently working through a novel failure case that is not fixable by the typical remediation steps we take - such as adding a new replica instance to a DB cluster - and attempts to launch a new cluster based on the latest point in time is also taking an extended period of time to complete. While we are waiting for these contingency measures to finish provisioning, the originally affected database cluster is beginning to show signs of self-recovery. So our team will continue to assess the situation and make a decision soon based on the earliest available resource to restore your applications. Depending on the outcome of that, the data may be very current or slightly (15-30 minutes) behind. We will continue to report on progress as it is made.

Posted Dec 04, 2020 - 19:20 UTC

Update

The vps-virginia-aurora-3 database cluster has experienced a critical failure. Although data integrity is still okay, we are having trouble getting the DB cluster to start. Pagely Engineers have already initiated the process to launch a new DB cluster with the latest available data set as a point-in-time recovery. Once this resource has finished creating, we will update your application to use the new endpoint.

Posted Dec 04, 2020 - 18:03 UTC

Update

We are continuing to work on a fix for this issue.

Posted Dec 04, 2020 - 17:26 UTC

Identified

Internal monitoring has alerted to an issue on a shared RDS that our Operations team is working to resolve.
Some customer sites, particularly uncached traffic requests, may be impacted while this is ongoing.

Apologies on the disruption and a further update will be posted shortly.

Posted Dec 04, 2020 - 17:22 UTC

This incident affected: VPS Hosting Infrastructure.