JSON Compression: Alternative Binary Formats and Compression Methods

TL;DR: If you are considering using an alternative binary format in order to reduce the size of your persisted JSON, consider this: the final compressed size of the data has very little to do with the serialization method, and almost everything to do with the compression method. In our testing, Brotli proved to be very effective for long-term persistence.

Motivation

There are a lot of data serialization formats out there. Perhaps none are more pervasive than JSON: the de facto serialization method for web applications. And, while it certainly isn’t perfect, its convenience and simplicity have made it our format of choice at Lucid. However, we recently undertook a project that made us question whether or not we should be using JSON at our persistence layer.

In order to improve the performance and fidelity of our revision history feature, we decided that we should start persisting ‘key-frames’ (snapshots) of our document-state data (rather than just the deltas). Our plan was initially to just gzip the document-state JSON when persisting the snapshots. However, as we started sampling some data and crunching the numbers, we realized that within a year or two we would have hundreds of terabytes of data, costing thousands of dollars per month in infrastructure costs. So, even if we could only reduce the size of our persisted data by a few percentage points, it would translate to real-world savings. Thus, we decided to investigate alternative serialization and compression methods to find the pair would minimize the costs of persisting the new data.

Serialization alternatives

JSON is human readable, relatively concise, simple to understand, and is universally supported. But, its simplicity and human-readability mean it isn’t the most space-efficient format out there.

For example, representing the number 1234.567890123457 will take 18 bytes in UTF-8 stringified JSON. However, a binary format could represent the same number as a 8-byte floating point double. Similarly false will be 5-bytes in JSON, but a single byte (or conceivably less) in a binary format. Because our document state includes plenty of booleans and numbers, it seemed like a no-brainer that a binary serialization technique would beat out JSON.

We decided to test out the following serialization methods [1]:

CBOR
Smile
BSON
MessagePack
Ion (Both Textual and Binary formats)

Compression alternatives

Historically, we have just used gzip for compression of our document-state because it is fast, gets effective results, and works natively in the JVM. However, a couple of years ago we started using Brotli to compress our static front-end javascript assets, and saw very good results. We thought it might be a good fit on our document-state JSON as well. We also decided to try XZ, Zstandard, and bzip2.

Methodology

As a test bed of documents, we decided to use our system templates (Lucidchart and Lucidpress ‘blueprint’ documents that we provide to our users) as our sample data. We have about 1500 templates totaling 133.8 MB of document-state JSON.

For this set of documents we tried every combination of binary format and compression algorithm (at their various compression levels [2]). From the tests, we wanted to record three primary metrics:

The total CPU time to convert from JSON, serialize, and compress
The final compressed size
The total CPU time to decompress, deserialize, and then convert back to JSON

Results

So, after running the tests and measuring the data, we ended up with something like this:

Binary Format	Compression	Compressed Size (bytes)	JSON -> Compressed Time (msec)	Compressed -> JSON Time (msec)
BSON	bzip2 (9)	17,344,419	3,979	3,651
CBOR	Uncompressed	101,739,795	492	2,310
Smile	Zstandard (0)	16,476,903	501	2,312
JSON	XZ (6)	12,908,440	5,850	2,195
CBOR	Zstandard (3)	15,923,174	541	3,138
CBOR	Zstandard (22)	13,999,497	12,826	2,698
Smile	Brotli (9)	14,704,655	3,729	2,675
CBOR	Zstandard (9)	14,625,854	704	2,877
Textual Ion	gzip	16,049,903	912	3,254
MessagePack	Zstandard (-5)	19,750,507	914	2,225
… 263 More Rows …

It was relatively easy to draw a couple simple conclusions from these results. For example, just looking at the uncompressed sizes, Binary Ion was by-far the most compact for our datasets. And looking at the compressed sizes, Textual Ion using the highest level of Brotli compression was the smallest.

Most compact binary formats

Binary Format	Uncompressed Size (bytes)
Binary Ion	63,672,734
Smile	72,283,777
MessagePack	96,113,007
CBOR	101,739,795
Textual Ion	117,878,664
BSON	129,823,535
JSON	133,284,487

10 most compact compressed formats

Binary Format	Compression	Compressed Size (bytes)
Textual Ion	Brotli (11)	11,903,951
JSON	Brotli (11)	11,999,727
MessagePack	Brotli (11)	12,194,016
Textual Ion	Brotli (10)	12,277,394
JSON	Brotli (10)	12,358,964
MessagePack	Brotli (10)	12,556,622
MessagePack	XZ (6)	12,734,272
Textual Ion	XZ (6)	12,840,804
MessagePack	XZ (5)	12,843,748
CBOR	Brotli (11)	12,861,137

However, our goal in comparing serialization formats and compression methods was not to simply find the smallest format. Our goal was to minimize our infrastructure costs. So, how do we use all of the measured data to find the optimal solution? Is the space savings by using Brotli 11 worth the extra CPU time it will take to compress?

Analysis

While impossible to perfectly predict, we actually can use the measured data to provide a pretty good estimate of how many dollars each method would actually cost Lucid. This is because our services are deployed in AWS, and we can choose to pay a ‘fixed cost per CPU second’ by using Lambda, and S3 costs for storing the data are relatively straightforward. So, if we want to calculate an expected costs, we simply need to get some estimates and assumptions for the following:

The cost per GB, per month to store the data in S3
The average cost per CPU second
The expected lifespan of the data
The expected number of times the persisted data will be used (decompressed and deserialized)

With these estimates and measured results, we can now assign an expected cost for every combination of serialization and compression technique and simply choose the one with the lowest expected cost!

Here are the assumptions that we ended up using:

AWS Pricing:

Lamda Cost per Second (for a full vCPU)	$0.00002917
S3 Standard Cost (per GB/Month)	$0.022
S3 IA Cost (per GB/Month)	$0.0125

Assumptions about our data

The percent of our snapshots that will end-up in Infrequent Access in S3	95%
Expected Lifespan of data	90 months
Number of times (on average) we would need to read (decompress and convert back to JSON) a given snapshot	2

And when you run those numbers, you get the following results:

Final expected pricing results

Serialization	Compression	Expected Cost
JSON	Brotli (10)	$0.01572
MessagePack	XZ (2)	$0.01579
JSON	XZ (6)	$0.01583
MessagePack	Brotli (6)	$0.01584
MessagePack	Zstandard (15)	$0.01587
MessagePack	XZ (6)	$0.01588
MessagePack	XZ (1)	$0.01588
MessagePack	Brotli (5)	$0.01589
…
JSON	gzip	$0.01870
…
JSON	Uncompressed	$0.14551

This shows that, given our measured results and estimated costs, our best bet is using JSON serialization and level 10 Brotli compression (its second highest setting). This represents an expected 16% cost savings over our baseline of JSON and gzip!

Conclusions

We actually tried a variety of assumptions to see if these results held true in different circumstances. Interestingly, JSON was always at the top of the list—and if it wasn’t the absolute best, it was still very competitive (within a couple percentage points of expected cost).

Our analysis reveals a few interesting conclusions that are almost certainly applicable to many others’ circumstances as well.

Binary formats do result in smaller uncompressed file sizes
Compressing the serialized data seems to level-the-playing-field and ‘negates’ any wins by using the binary format.
Thus, the final compressed size of the data has very little to do with the serialization method, and almost everything to do with the compression method.
Choosing the best compression algorithm is a balancing game between the cost to store the data and the cost to compress the data, but you can choose the right balance according to your expected lifecycle and read patterns.

Recommendations

While it won’t apply to every scenario, the general takeaways and recommendations from the observed data are this:

If the data originated as JSON, and needs to be converted back to JSON in-order to be used, then JSON is probably going to be the most cost-effective persistence format as well. Just so-long as you choose the appropriate compression algorithm.

Brotli works really, really well with JSON data. At its higher levels (10 & 11), it can be CPU expensive, but that will be cost-effective when the data has a long lifespan. Brotli also has the advantage that it can be served directly to any modern major browser.

For data with shorter lifespans, Zstandard (around level 9) offers much better compression than gzip but at roughly the same CPU cost.

We’ve published the full set of data and assumptions here. You’re welcome to take our measured data set and plug in your own own assumptions and costs to see what formats might meet your needs. Of course, our document data won’t be necessarily representative of your data, but this might be a helpful starting point in comparing your options.

Footnotes

1. We only considered serialization formats that were compatible with generic JSON, as opposed to serialization methods that require a pre-defined schema – like Protobuf.

2. Generally, compression functions allow the user to specify the compression level. For Brotli, bzip2, and XZ, we tried all of the available compression levels. Zstandard provides an anostishing 27 different compression levels, so we only tested a subset of those. And while gzip does support different compression levels, the GzipInputStream class included in the JDK doesn’t support it (or at least not very well). Also we went into this with gzip as the baseline, we were not expecting it to be a contender. So, for gzip we just used the default compression level.

7 Comments

J.Smith • December 10, 2019 at 12:40 am

Is the test data available?
Richard Shurtz • December 10, 2019 at 11:45 am

I have just uploaded it – you can download it using this link.
Paul Draper • December 11, 2019 at 6:52 pm

Very interesting; thanks for the testing.
boogerlad • August 9, 2020 at 9:42 am

I noticed in the Google Sheets data that JSON uncompressed has a “JSON -> Compressed” time of 501.4ms and “Compressed -> JSON” time of 1347.9ms. Is that the time of `JSON.stringify` and `JSON.parse`, respectively?
Richard Shurtz • August 9, 2020 at 12:07 pm

“JSON -> Compressed” is the total time it took to take an in-memory representation of the JSON data-structure (using the Jackson library) to it’s final compressed state. The process was essentially two steps: first, write it to a Byte array using the specified serialization technique (BSON, JSON stringification, MessagePack, etc.), and secondly, run that Byte array through the compression algorithm to get the final compressed result.
“Compressed -> JSON” was the measured time it took to go through those two steps back to in-memory JSON.

Hope that helps!
Raja Nagendra Kumar • November 11, 2021 at 7:54 pm

It would have been nicer if the measurements showed, how much extra time that it would add as part of compression time and decompression times too and also marshaling and unmarshaling the JSON.
Clark • January 18, 2023 at 8:16 am

Hi! Could you check this compression library in your tests?
https://www.npmjs.com/package/@xobj/core