Scaling a startup with database sharding

For months now, the business team at Lucidchart has been overhearing discussions of “sharding” from the other end of the office. As a curious tech geek, I wanted to know what was going on. I ended up cornering Brian Pugh, VP of Engineering, and Matthew Barlocker, Chief Architect, to interrogate them about their sharding project. I ended up getting the history of our database structure and the changes it’s undergone in the last 2 years.

When Lucidchart began as Ben Dilts’ brainchild back in 2008, he built the application on CakePHP. He chose a technology that would allow him to get a simple backend service up and running quickly, so he could focus on making a killer application that pushed the bounds of what can be done in a browser. As Lucidchart’s subscription user base grew, the relatively simple, monolithic backend no longer met the performance and scalability needs of our customers. Pugh and Barlocker were tasked with building a scalable and performant service for Lucidchart’s growing user base.

[Interview transcript starts after the image]

database sharding plan

Q: I don’t want to take too much of your time, but can you take me from where you started to where we are now?

Barlocker: We started on MySQL, it was a single monolithic server that got too big. We started seeing a lot of performance issues. The thing with databases is that they do not scale linearly. The more stuff you put in them, it’s going to look linear for a little while and then it’s just gonna take some massive jump in resource consumption and latency, and that’s what happened. So we started to find ways to take stuff out of MySQL and put it somewhere else.

We could have gone with a sharded MySQL approach or with Mongo, which itself is sharded, but doesn’t take any work on our part setting it up. The quickest and most cost effective way to resolve our performance issues was going to Mongo, so that’s what we did. Documents went first, which eased the strain on the MySQL server. We started seeing better performance all around. We currently have 2 Mongo shards for documents. That’s what we started with, and everything works fine.

Then we started breaking out some of the code base to a services approach. Each service has their own set of MySQL databases, all sharded. At some point we decided that Mongo is the way we wanted to go for our analytics data. The 99% reliability Mongo provides is just fine for us and it’s faster to insert. Now, we’re just trying to finish the job–getting all of our data out of the monolithic MySQL database.

Pugh: Right, Mongo has been great for analytics and is holding up for documents, but we are moving to a model in which each functional area of the system is a service with its own MySQL sharded database.

Q: So that’s why you guys are always saying “sharded” over here, because you’re doing this massive project where you’re moving this legacy database to a sharded approach.

Barlocker: That’s right. It’s all about scalability.

Pugh: Every time we build a new service, we build it on some kind of sharded infrastructure. Whether it’s Mongo or MySQL, we’re not building any more non-sharded solutions.

Q: Can you talk about wanting to move from Mongo back to MySQL ?

Barlocker: So, documents were originally stored in MySQL. The reason that we pulled them out of MySQL and put them into Mongo is because we were having gigantic performance problems with the documents and the change records on a document. Once we pulled that out, performance was a lot better. The real bottleneck was the way we composed the SQL queries. MySQL could have performed just fine if we did an extensive audit of all SQL queries in our codebase. We were using Cake and joining it everywhere and having these 32 megabyte blobs copied into 2 to 6 temporary tables before we actually read it out of MySQL.

Pugh: And fundamentally, we didn’t shard our data at all. Because there was no sharding going on, all data operations went to one server. With Mongo, we use multiple servers to store the data and can add new servers when needed. You can shard with MySQL, but we didn’t have the infrastructure in place to shard with MySQL a year and a half ago when we made the decision to go with Mongo for documents. Since then, we’ve put into place sharding with MySQL for other data, but at this point documents themselves are still in Mongo.

Q: I really don’t know what sharding is, but I think that you’re taking columns from a table and splitting them up?

Barlocker: You can take columns and split them up, that’s vertical sharding. Horizontal sharding is what we’re talking about–it scales much better than vertical sharding. It’s horizontal sharding that says, “I’m gonna have the same tables and columns on every server, but I’m gonna have this row on this server, that row on that server.” This is the way Mongo does it, and it’s now the way we do it with MySQL as well.

Q: So it’s basically taking a table, breaking it, then putting different rows into different databases.

Pugh: We wouldn’t do it this way, but assuming the key on a table of users was first name, we could do something like put A-F on one server, and G-L on a second server, and so on. You use something as a key that you hash on, and then based on the hash, send the data to the right server. So different data of the same kind, like users, will go on different servers. We can always move data around to different shards if we need to, though.

Q: So basically, when you have these giant tables, you want to have it organized in smaller pieces?

Pugh: That’s right. It can go out to a new server, so you can scale servers horizontally, not just vertically. That’s the traditional problem you have with databases. The only way to scale them is to add more RAM or CPUs or whatever resource it is that is being exhausted to that single server. With sharding, we can add a new shard, which can be on a new server. I could have 10 servers, and add an 11th, 12th, 13th…I can just keep adding more servers, rather than having one massive server. I still have the option to add more resources to my existing servers, but sharding also gives me the option to just add more servers. When you only scale with a single server, you get to the point where you have to have one massive, super expensive machine, and you just hope it’s massive enough to handle the load.

Pugh: So horizontal scaling only works well if it’s easy to add additional shards–which it is for MongoDB–and in our case, it’s now straightforward in MySQL as well because we built our sharding approach with that in mind.

Q: Was there any debate between you and Brian about whether Mongo was going to be a good long-term solution?

Barlocker: At the time, I agreed that Mongo was the right thing to do. It was a quick, scaleable fix that improved performance immediately. Now, a year later, we’re at a point where we see, oh yeah, it wasn’t MySQL at all, it turns out we were joining everything and Cake did it horribly. We can put all the documents back into MySQL, get the ACID compliance, and we’re not losing anything. We’ve found that Mongo has too many exceptions, timeouts, and SSL issues to be a perfect solution.

Pugh: The only place where we disagree at all is around timing and priority, and even there we don’t disagree much. The question is whether it is worth moving documents to a sharded MySQL solution now, given the resources and other priorities we have. Fundamentally, both of us would like to get to the same endpoint eventually.

Q: How long until everything is on MySQL sharded databases? What work is left to do?

Barlocker: We’re currently doing a big push to get all of our users and payments into their own services, backed by a sharded MySQL database. After we release this new service, and the image service that was written a short time ago, we will only have a few services left. I expect that early next year we will be done. As we pull more things from the monolithic database, it becomes less important to scale it…so we probably have a long tail ahead of us.

Related Articles

Using Scala to handle exponential growth at a startup

How to Debug Disk Full Errors in Redshift

Using Scala to handle exponential growth at a startup (Part 2)

Performant Database Access with Relate

No Comments, Be The First!