Lucidchart’s Database Migration to Amazon Aurora

Lucidchart was built on AWS from day one, so I was very excited for the opportunity to go to Seattle and have Lucidchart featured on the AWS video series, “This is My Architecture.” I talked about our database migration to Amazon Aurora—you can watch the interview below, and for those interested in the details of the process, continue reading.

When a customer signs up for Lucidchart, they expect to be able to create new documents and open existing ones. Seems simple enough. Because of this core requirement, we had a requirement from the very beginning to have a very low latency, write-optimized database cluster. We were using MySQL on EC2 and EBS as the underlying data store.

Another big requirement for us was being able to have very little downtime when switching availability zones so that if we ever needed to do database maintenance or something similar, our users weren’t impacted. What we ended up with was a master-master MySQL implementation—and that presented some challenges.

Challenges with current state

The challenges that we ran into were the same challenges that you’re going to run into any time you have a complex database implementation. In addition to our master-master implementation that supports our applications, Lucidchart and Lucidpress, we also had to support the rest of our internal business tools. For example, our business analytics and business intelligence tools are very read-heavy, but our application is designed to be write-heavy, because as users make changes to diagrams, we need to persist those as quickly as possible.

To support both the application (write-heavy) and internal tools (read-heavy), we ended up with a master/master SQL cluster and multiple slaves pointing at the masters. This setup introduced a lot of complexity in maintaining the health of the cluster. At peak we had 15 clusters and around 50 database instances.

We also ran into another issue regarding backups—for legal and contractual purposes, we are required to keep backups for an extended period of time. Every time we would initiate a snapshot on the underlying EBS volumes, there would be a major performance hit, because in addition to the running application, there was also the snapshot load.

Identifying a better solution

As a SaaS company, we spend a lot of time and effort making an awesome product. Anytime that we spend building out infrastructure, especially if somebody else has built that infrastructure or has the capability of building it for us, is opportunity lost for my team.

On average, maintaining and optimizing our database clusters took 25% of a senior developer’s time. That time was spent on resolving data collisions, balancing and optimizing read/write capacity, EBS and EC2 optimization, and disaster recovery/business continuity verification. We knew that we wanted a managed database service, but none of the services that we looked into met all of our needs.

We had been excited about Aurora for quite a while. They were touting some pretty outstanding claims, like up to 5x performance increase and zero downtime patches. It seemed like from the time that we started watching it, every couple months, they were checking more and more boxes of requirements that we had.

It wasn’t until they said they could do encryption at rest out of the box in Aurora, but also encryption cross region out of the box—meaning we could have a primary cluster in one region with a backup cluster in a second region with real-time replication and encryption—that Aurora really became a possibility for us.

Migrating to Aurora

Disaster recovery

Aurora is a MySQL compatible cluster, which made our transition much easier. We realized that Aurora’s outstanding claims turned out to be true, including the fact that every Aurora instance has at least six disks spread across three availability zones.

When a database write operation occurs, the master node in the Aurora cluster requires that at least three of the disks persit the transaction or it is not counted as a success and is rolled back. Even if an availability zone is having an issue, or in some cases, two availability zones, we do not have to worry about data loss or corruption. It also means that we are able to perform database maintenance, such as security patches, version upgrades, etc., without the risk of impacting replication or our users.

When automated backups are enabled on an Aurora cluster, it allows you to perform a “point in time” restore. The automated backup retention period is configurable from one to 35 days. When performing a point in time restore, you can select a specific time anytime within the configurable window with second-level granularity.

automated backup

Aurora also supports manual snapshots, which can be kept for as long as you would like. These snapshots can also be copied to different regions for disaster recovery purposes. We use a lambda function to take the snapshot and also to purge snapshots as they age out of our retention periods.

Encryption

Encryption is an important part of our infrastructure. We have always maintained encryption at rest ourselves using LUKS. Aurora’s ability to use disk-level encryption out of the box simplified our configuration significantly. Aurora allows you to encrypt your cluster, and it’s as simple as clicking the checkbox “encryption” while creating the cluster. You will be prompted to select the encryption key, which is stored in Amazon Key Management Service (KMS).

Cost savings

Moving to Aurora ended up saving us quite a bit of money in hosting and data costs.

cost savings

The diagram above shows what one of our MySQL clusters looked like. We used the master instances for all of our services. The snapshot slaves were only used for manual queries as well as snapshotting a specific master so that we could restore without rebuilding the entire cluster.

The first cost savings came from decreasing our instance count from five instances to three instances per cluster. Aurora instances are slightly more expensive than regular EC2 instances of the same size/type, but we more than made up the difference by being able to drop two of the SQL instances from the cluster. We removed one of the masters and one of the slaves.

cost savings 2

The second cost savings came from the persistent storage. Aurora only charges for what you use, and the disks grow dynamically. With our MySQL implementation, each instance had to have more than enough disk space to handle growth. We were able to expand EBS volumes,but it was not something that we enjoyed managing. We would usually increase our EBS volumes in 500GB steps. Every time we did that it would increase the EBS usage by at least 2.5TB per cluster. Aurora only charges for data one time for a given cluster even though there are at least six copies of the data. This decreased our storage cost by almost 5x.

The last (and greatest) cost savings came from the way snapshots work. We would take at least two snapshots per cluster. We also managed disk-level encryption ourselves via LUKS. As our users made changes to their documents, we would update rows in our database. Even though there would be very little actual data change in a customer’s dataset, the encrypted bits would look completely different on disk. When we would take a snapshot, even though snapshots are incremental, we would end up having very large snapshots due to the frequent updates. Aurora is very smart about what and how it backs up data. Our snapshot sizes dropped almost 10x.

Without calculating in the engineering and opportunity costs, our overall database costs decreased roughly 25% after switching to Aurora.

Standing up our first instance

As soon as Aurora came out and said that they supported real-time cross-region replication with both regions being encrypted, we decided that it was time to give it a real trial and make sure that it would actually meet our needs. In the end, we decided to move to Aurora.

We stood up our first instance as a standalone. We just exported the data from our database and then imported it into the Aurora cluster. First, we noticed that the import was about three times faster than what we could do on our own cluster, even with all of the tuning and optimizations we had put in. That validated the first claim that Aurora can be up to five times faster than MySQL.

Once we imported that data, we tested the multi-availability-zone capabilities of Aurora. We went into Amazon, right-clicked on our instance, and spun up a second instance in another availability zone. This process usually took about 24 hours, because this was one of our larger database instances, and in Aurora, it only took two hours to create that slave in the second availability zone.

Increased performance

Performance is a big benefit that we’ve found after moving to Aurora. Our writes are roughly the same speed, but the read capabilities that Aurora provides gave us far better performance. First off, Aurora is very optimized for parallel queries. Our SQL instances were only as optimized as the number of instances that we could stand up. But due to the underlying disk architecture of Aurora, you get better throughput for the queries that are issued against the Aurora clusters.

Second, when you stand up an Aurora cluster, it gives you two different endpoints that you can connect to. There’s a read/write endpoint, but there’s also a read-only endpoint.

increased performance

The diagram above shows two different workload types. The one on the left represents workloads that need to have read-write access, the one on the right represents workloads that only need read access. To send read-write traffic to the master node, you use the read-write endpoint. The workloads that need read-only access can use the read-only endpoint. Many of our services will actually use both endpoints at the same time. Our services will use the read-write endpoint if they need write access or guaranteed up-to-date data, but they will prefer the read-only endpoint if they only need to perform reads.

Increased stability

We calculate database uptime based off a cluster’s write availability. Several factors can impact this metric, such as manually putting the cluster in read-only mode, data inconsistencies, performance issues, or outright failures. Before we migrated to Aurora, the cluster with the worst write availability had an average of 99.95% in 2016. We started using Aurora for all of our database clusters at the beginning of 2017. Our worst cluster in 2017 had a write availability of 99.999%.

Conclusion

We are very happy with our transition to Aurora for two main reasons.

The first is that Aurora is MySQL compatible and supports being both a master and a slave. This functionality enabled us to move to Aurora with zero downtime. All that we had to do was create the Aurora cluster and make it a slave to our existing MySQL cluster. After replication had caught up, we changed the DNS record to point our application at Aurora instead of our MySQL clusters. We did have to do this for each of our services and MySQL clusters, but the process was very simple.

The second main benefit was that Aurora simplified the process of keeping data in multiple regions. For disaster recovery and business continuity purposes, we have to keep data in multiple regions, not only in different availability zones. Before we moved to Aurora, we would have to copy snapshots periodically between our primary region and our secondary region, so our snapshots were always out of date by some period of time.

With Aurora, we went to the master cluster, right-clicked, and selected “Create a replica in a different region.” That not only copied the data, but it also created a database cluster in that region that is completely up to date, in real time, only delaying by a few milliseconds.

Over the years, I have been involved in many vendor evaluations, and often the claims that the vendor makes start to fall over when you get into the details of the offering. You should always do your homework before switching to a new solution, but for us, Aurora surpassed all of our expectations. Our main takeaway during this process is that Aurora delivers.

1 Comment

  1. It looks like one of the trade offs you made that you don’t discuss was the switch from multi-master to a single master. It seems like that would reduce write performance and potentially availability. Can you talk about why felt comfortable going with a solution that seemingly now involves a single point of failure on the write side?

Your email address will not be published.