Summary
On Friday December 4, 2020, one of our managed Amazon Aurora Database Clusters, vps-virginia-aurora-3, experienced an extended outage lasting approximately three and a half hours. A relatively small portion of the overall sites hosted in this region reside on this DB cluster, that is to say that there was plenty of spare capacity at the time of the event. Affected sites experienced database connection errors throughout the duration of the event. The recovery point of your data when services were restored is approximately 15-70 minutes prior to the onset of the service interruption.
- Database services for the affected sites were unavailable between 9:15AM PST and 12:30PM PST.
- By 12:30PM PST, service availability was restored to a backup DB cluster with a 9:00AM PST point-in-time.
- By approximately 7:00PM PST, all restored sites were fully migrated to a brand new Aurora RDS DB Cluster and away from the problem system.
More Details
Our investigation is ongoing and we are working closely with the team at Amazon to fully understand the nature of this issue. Although database issues have happened in the past, they are usually resolved within a few minutes, not hours. An event of this nature had not occurred for us before. We need more time to investigate the matter with Amazon before we can say definitively what the cause was. Rest assured, both Pagely and Amazon are interested in finding a root cause so that a similar event can not occur in the future. We have already had some great discussions on mitigating this type of impact in the future, and we continue to work on determining a root cause.
We can tell you that the behavior we observed of this Aurora Cluster was not typical and it also got the attention of the database team at AWS who, independent of Pagely's investigation, noticed the DB cluster was behaving erratically and connected with us to let us know they're applying an emergency fix. Typical actions such as adding a reader instance, performing a failover, restarting a DB instance, were not working for this cluster until steps were taken by AWS to address a problem they were seeing.
While the issue was ongoing with the original DB cluster, Pagely Engineers were also launching new DB clusters with varying point-in-time recovery targets. This is a proactive step we will take if we feel the time it could take for a system to recover exceeds the time it may take to launch a new DB cluster with slightly older data. Our goal during these moments is to get sites running again as quickly as possible and with the last-known good set of data.
At a certain point in the incident, because things were taking so long, we told you we'd restore from older (less than 24hrs) SQL backups, but we actually were able to get a DB cluster launched with a fairly recent point-in-time recovery target (15-70 minutes old). After this recovery was performed, and with the assistance of AWS, the originally affected system was also brought back to an operational state. This system is currently under evaluation and is not currently powering any of your live sites. Migrations were performed to get all affected sites relocated to a completely different and newly built Aurora DB cluster. With that said, if you think you are missing any data please us know and we can provide you with a separate SQL dump from the affected system for manual examination.
We want to assure you that every step was taken to restore service availability as soon as possible and with the most current possible data set. Some of these operations take time to complete, even when everything is working correctly. When things are not working correctly, recovery timelines can be impacted further. We have a playbook we follow in these situations and we always try to think a few steps ahead. This typically leads to no or very little noticeable impact to your services, but then there are days like today. We always work to keep events of this severity a rarity, if not a faint memory, most of the time, and we thank you for your understanding as we worked to get things back to normal.
Posted Dec 05, 2020 - 05:29 UTC