JSON Compression: Alternative Binary Formats and Compression Methods

TL;DR: If you are considering using an alternative binary format in order to reduce the size of your persisted JSON, consider this: the final compressed size of the data has very little to do with the serialization method, and almost everything to do with the compression method. In our testing, Brotli proved to be very effective for long-term persistence.

Motivation

There are a lot of data serialization formats out there. Perhaps none are more pervasive than JSON: the de facto serialization method for web applications. And, while it certainly isn’t perfect, its convenience and simplicity have made it our format of choice at Lucid. However, we recently undertook a project that made us question whether or not we should be using JSON at our persistence layer.

In order to improve the performance and fidelity of our revision history feature, we decided that we should start persisting ‘key-frames’ (snapshots) of our document-state data (rather than just the deltas). Our plan was initially to just gzip the document-state JSON when persisting the snapshots. However, as we started sampling some data and crunching the numbers, we realized that within a year or two we would have hundreds of terabytes of data, costing thousands of dollars per month in infrastructure costs. So, even if we could only reduce the size of our persisted data by a few percentage points, it would translate to real-world savings. Thus, we decided to investigate alternative serialization and compression methods to find the pair would minimize the costs of persisting the new data.

Serialization alternatives

JSON is human readable, relatively concise, simple to understand, and is universally supported. But, its simplicity and human-readability mean it isn’t the most space-efficient format out there. 

For example, representing the number 1234.567890123457 will take 18 bytes in UTF-8 stringified JSON. However, a binary format could represent the same number as a 8-byte floating point double. Similarly false will be 5-bytes in JSON, but a single byte (or conceivably less) in a binary format. Because our document state includes plenty of booleans and numbers, it seemed like a no-brainer that a binary serialization technique would beat out JSON. 

We decided to test out the following serialization methods [1]:

Compression alternatives

Historically, we have just used gzip for compression of our document-state because it is fast, gets effective results, and works natively in the JVM. However, a couple of years ago we started using Brotli to compress our static front-end javascript assets, and saw very good results. We thought it might be a good fit on our document-state JSON as well. We also decided to try XZ, Zstandard, and bzip2.

Methodology

As a test bed of documents, we decided to use our system templates (Lucidchart and Lucidpress ‘blueprint’ documents that we provide to our users) as our sample data. We have about 1500 templates totaling 133.8 MB of document-state JSON.

For this set of documents we tried every combination of binary format and compression algorithm (at their various compression levels [2]). From the tests, we wanted to record three primary metrics:

  • The total CPU time to convert from JSON, serialize, and compress
  • The final compressed size
  • The total CPU time to decompress, deserialize, and then convert back to JSON

Results

So, after running the tests and measuring the data, we ended up with something like this:

Binary FormatCompressionCompressed Size (bytes)JSON -> Compressed Time (msec)Compressed -> JSON Time (msec)
BSONbzip2 (9)17,344,4193,9793,651
CBORUncompressed101,739,7954922,310
SmileZstandard (0)16,476,9035012,312
JSONXZ (6)12,908,4405,8502,195
CBORZstandard (3)15,923,1745413,138
CBORZstandard (22)13,999,49712,8262,698
SmileBrotli (9)14,704,6553,7292,675
CBORZstandard (9)14,625,8547042,877
Textual Iongzip16,049,9039123,254
MessagePackZstandard (-5)19,750,5079142,225
… 263 More Rows …

It was relatively easy to draw a couple simple conclusions from these results. For example, just looking at the uncompressed sizes, Binary Ion was by-far the most compact for our datasets. And looking at the compressed sizes, Textual Ion using the highest level of Brotli compression was the smallest.

Most compact binary formats

Binary FormatUncompressed Size (bytes)
Binary Ion63,672,734
Smile72,283,777
MessagePack96,113,007
CBOR101,739,795
Textual Ion117,878,664
BSON129,823,535
JSON133,284,487

10 most compact compressed formats

Binary FormatCompressionCompressed Size (bytes)
Textual IonBrotli (11)11,903,951
JSONBrotli (11)11,999,727
MessagePackBrotli (11)12,194,016
Textual IonBrotli (10)12,277,394
JSONBrotli (10)12,358,964
MessagePackBrotli (10)12,556,622
MessagePackXZ (6)12,734,272
Textual IonXZ (6)12,840,804
MessagePackXZ (5)12,843,748
CBORBrotli (11)12,861,137

However, our goal in comparing serialization formats and compression methods was not to simply find the smallest format. Our goal was to minimize our infrastructure costs. So, how do we use all of the measured data to find the optimal solution? Is the space savings by using Brotli 11 worth the extra CPU time it will take to compress? 

Analysis

While impossible to perfectly predict, we actually can use the measured data to provide a pretty good estimate of how many dollars each method would actually cost Lucid. This is because our services are deployed in AWS, and we can choose to pay a ‘fixed cost per CPU second’ by using Lambda, and S3 costs for storing the data are relatively straightforward. So, if we want to calculate an expected costs, we simply need to get some estimates and assumptions for the following:

  • The cost per GB, per month to store the data in S3
  • The average cost per CPU second
  • The expected lifespan of the data
  • The expected number of times the persisted data will be used (decompressed and deserialized)

With these estimates and measured results, we can now assign an expected cost for every combination of serialization and compression technique and simply choose the one with the lowest expected cost! 

Here are the assumptions that we ended up using:

AWS Pricing: 

Lamda Cost per Second (for a full vCPU)$0.00002917
S3 Standard Cost (per GB/Month)$0.022
S3 IA Cost (per GB/Month)$0.0125

Assumptions about our data

The percent of our snapshots that will end-up in Infrequent Access in S395%
Expected Lifespan of data90 months
Number of times (on average) we would need to read (decompress and convert back to JSON) a given snapshot2

And when you run those numbers, you get the following results:

Final expected pricing results

SerializationCompressionExpected Cost
JSONBrotli (10)$0.01572
MessagePackXZ (2)$0.01579
JSONXZ (6)$0.01583
MessagePackBrotli (6)$0.01584
MessagePackZstandard (15)$0.01587
MessagePackXZ (6)$0.01588
MessagePackXZ (1)$0.01588
MessagePackBrotli (5)$0.01589
JSONgzip$0.01870
JSONUncompressed$0.14551

This shows that, given our measured results and estimated costs, our best bet is using JSON serialization and level 10 Brotli compression (its second highest setting). This represents an expected 16% cost savings over our baseline of JSON and gzip!

Conclusions

We actually tried a variety of assumptions to see if these results held true in different circumstances. Interestingly, JSON was always at the top of the list—and if it wasn’t the absolute best, it was still very competitive (within a couple percentage points of expected cost).

Our analysis reveals a few interesting conclusions that are almost certainly applicable to many others’ circumstances as well.

  • Binary formats do result in smaller uncompressed file sizes
  • Compressing the serialized data seems to level-the-playing-field and ‘negates’ any wins by using the binary format.
  • Thus, the final compressed size of the data has very little to do with the serialization method, and almost everything to do with the compression method.
  • Choosing the best compression algorithm is a balancing game between the cost to store the data and the cost to compress the data, but you can choose the right balance according to your expected lifecycle and read patterns.

Recommendations

While it won’t apply to every scenario, the general takeaways and recommendations from the observed data are this:

  • If the data originated as JSON, and needs to be converted back to JSON in-order to be used, then JSON is probably going to be the most cost-effective persistence format as well. Just so-long as you choose the appropriate compression algorithm.
  • Brotli works really, really well with JSON data. At its higher levels (10 & 11), it can be CPU expensive, but that will be cost-effective when the data has a long lifespan. Brotli also has the advantage that it can be served directly to any modern major browser.
  • For data with shorter lifespans, Zstandard (around level 9) offers much better compression than gzip but at roughly the same CPU cost.

We’ve published the full set of data and assumptions here. You’re welcome to take our measured data set and plug in your own own assumptions and costs to see what formats might meet your needs. Of course, our document data won’t be necessarily representative of your data, but this might be a helpful starting point in comparing your options. 

Footnotes

1. We only considered serialization formats that were compatible with generic JSON, as opposed to serialization methods that require a pre-defined schema – like Protobuf.

2. Generally, compression functions allow the user to specify the compression level. For Brotli, bzip2, and XZ, we tried all of the available compression levels. Zstandard provides an anostishing 27 different compression levels, so we only tested a subset of those. And while gzip does support different compression levels, the GzipInputStream class included in the JDK doesn’t support it (or at least not very well). Also we went into this with gzip as the baseline, we were not expecting it to be a contender. So, for gzip we just used the default compression level.

5 Comments

  1. Is the test data available?

  2. Richard ShurtzDecember 10, 2019 at 11:45 am

    I have just uploaded it – you can download it using this link.

  3. Very interesting; thanks for the testing.

  4. I noticed in the Google Sheets data that JSON uncompressed has a “JSON -> Compressed” time of 501.4ms and “Compressed -> JSON” time of 1347.9ms. Is that the time of `JSON.stringify` and `JSON.parse`, respectively?

  5. Richard ShurtzAugust 9, 2020 at 12:07 pm

    “JSON -> Compressed” is the total time it took to take an in-memory representation of the JSON data-structure (using the Jackson library) to it’s final compressed state. The process was essentially two steps: first, write it to a Byte array using the specified serialization technique (BSON, JSON stringification, MessagePack, etc.), and secondly, run that Byte array through the compression algorithm to get the final compressed result.
    “Compressed -> JSON” was the measured time it took to go through those two steps back to in-memory JSON.

    Hope that helps!

Your email address will not be published.