Performant Database Access with Relate

At Lucid, we deal with data—a LOT of it. Ever since we migrated our code to the Play! Framework, we’ve been leveraging the Anorm database access library provided by the framework. Being able to write our own SQL queries gave us a lot more control than using an ORM, and that was one of our favorite features. However, when our queries got big enough to involve 100k rows, Anorm’s query parsing started taking minutes to do, and it quickly became clear that Anorm’s focus on syntactic sugar, while nice, was not meeting our needs. The Lucid ecosystem eventually outgrew Anorm, and we needed something with better performance.

Introducing Relate

To fulfill our increased performance needs, we developed Relate, our own data access layer. We set out to create an easy-to-use API that maintained the inherent speed of using JDBC. As such, Relate is a thin layer over JDBC, abstracting away the pain of developing with JDBC, but keeping performance overhead to a minimum. Speed has always been one of our highest priorities and we’re happy to announce that Relate is pretty darn fast.

Our first set of benchmarks dealt with data retrieval and Relate’s parser. A simple select query that selects only one string column from a table of a million records illustrates Relate’s speed. Here’s a graph that shows the effects of running the same query 20 times with different result set sizes:

Selecting 10 bigint columns shows a similar trend:

The second set of benchmarks concerned data insertion. Once again, using Relate for inserting rows into the database produced significantly better times and growth rates than Anorm. Here’s a graph demonstrating inserting rows of 10 integer columns:

Inserting 10 columns worth of bigints and strings resulted in similar graphs:

Finally, we benchmarked updating rows in the database. Here’s the result of updating rows containing 10 int columns:

As you can imagine, we’re rather pleased with these results.

How we use Relate at Lucid

These statistics are nice, but how does that help us here at Lucid? Besides performance gains, Relate also provides nice utilities to make inserting data into a query simpler; Relate makes list insertion (such as for the IN clause or multiple insertion) easy. It uses Scala’s CanBuildFrom trait, so retrieving data as a list, map, or other collection is faster than ever. Best of all, using Relate in our existing Play! projects is a breeze. All we have to do is drop in the Relate dependency in our Build.scala file and begin writing queries using Relate.

Relate is open source, on Github, and accepting pull requests. Contribute to it. Fork it. Use it. We’re confident that when you get involved with Relate, you’ll be just as excited about it as we are!

7 Comments

Julien Tournay • June 25, 2014 at 5:55 am

Hi,

In the Anorm benchmark, your using String concatenation instead of query params:

SQL("SELECT `col14` FROM `sel_50` WHERE `col44` IN ("+strnums+") ")

I suggest you replace that with:

SQL"SELECT `col14` FROM `sel_50` WHERE `col44` IN ($strnums) "
Mark Siebert • June 25, 2014 at 10:58 am

Good catch! The project is open source, so if you’d like, you can totally fork it and make that change 🙂
Erik van Oosten • July 3, 2014 at 4:41 am

Would be interesting to compare this with https://github.com/scalikejdbc/scalikejdbc also.
applicius • July 9, 2014 at 3:16 am

I agree with Julien Tournay. I would be interested to see benchmarks not using the ‘biased’ way (concatenation) but with real Anorm usage such as String interpolation.
Alvaro Carrasco • September 4, 2014 at 11:01 am

Although I understand that the benchmark is open-source, the onus is on you, the creator of the benchmark and the one trying to prove something, to make sure that it is a valid and fair comparison. You can’t just create an unfair comparison an expect other contributors to make it fair for you.

As much as I dislike anorm and relate may be much faster, this benchmark is very narrow in scope and has a number of flaws which make it hard to take it seriously. There is no JVM warm-up performed, it uses currentTimeMillis which is unreliable, the code uses an older version of anorm, it uses anorm’s less-performant technique (parameters instead of sql string interpolation), it focuses in some strange cases (large in clauses, parsing/casting unrelated types, etc) with limited applicability, etc.

I recommend you follow some guidelines like: http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java
matthew • September 4, 2014 at 5:27 pm

Thanks for the comment Alvaro. Let me address your most fundamental of assertions.

When we found that getting the performance we needed out of Anorm was impossible without gutting it, we decided to write our own library (Relate). Before blindly putting the new library into our production environment, we wanted verification that Relate would perform better. So, it’s not so much a matter of proving that Relate is better in the general sense, but proving that Relate would fix our issues. The point of the blog post and the open source benchmarks was to invite the community to test it out for themselves. Not all workloads, queries, environments are the same. We don’t expect Relate to work in all scenarios, but we found that it works very well for ours. This should explain some of my responses to your specific comments.

this benchmark is very narrow in scope

Absolutely. We only wanted to test our workload. It would be presumptuous of us to guess what your workload is, so we leave that to you.

There is no JVM warm-up performed

You are correct in that we have no automated JVM warmup in the benchmarks themselves, but we did warm up the JVM in the production of the results you see in this post.

it uses currentTimeMillis which is unreliable

You’ll notice that the Y-axis, though measured in milliseconds, is in the range of seconds to minutes. No amount of unreliableness in currentTimeMillis will account for Anorm being minutes slower than JDBC.

the code uses an older version of anorm

The benchmarks are using the version of anorm that our production environment was using. Again, we wanted to verify that Relate wouldn’t screw up our production environment. If your environment is different, I recommend that you change the benchmarks appropriately.

it uses anorm’s less-performant technique (parameters instead of sql string interpolation)

We used the less-performant technique in our production code.

it focuses in some strange cases (large in clauses, parsing/casting unrelated types, etc) with limited applicability

This may be limited applicability for your workload, but it is the primary use case for ours.

I hope that it is clear why we wrote the benchmarks the way we did. Our goal was not to prove that Relate is the end-all be-all of database access layers, or to show everyone that they should be using Relate. Rather, it was to provide conclusive evidence that this new library solved a specific problem that Lucidchart and Lucidpress were facing.

Thanks for your feedback, and good luck with your application!
Alvaro Carrasco • September 4, 2014 at 10:50 pm

Fair enough. I have no problem with your project and it might very well perform the way you describe. Just gotta be careful with benchmarks and the way they are presented, if we have some expectation of them being useful to others.

Introducing Relate

How we use Relate at Lucid

Related Articles

Scaling a startup with database sharding

Upgrading From Play Framework 2.3 to Play 2.5

How to Debug Disk Full Errors in Redshift

Using the @repeat helper in Play!

Play! 2.3 Template Improvements

7 Comments