TL;DR: If you are considering using an alternative binary format in order to reduce the size of your persisted JSON, consider this: the final compressed size of the data has very little to do with the serialization method, and almost everything to do with the compression method. In our testing, Brotli proved to be very effective for long-term persistence.
There are a lot of data serialization formats out there. Perhaps none are more pervasive than JSON: the de facto serialization method for web applications. And, while it certainly isn’t perfect, its convenience and simplicity have made it our format of choice at Lucid. However, we recently undertook a project that made us question whether or not we should be using JSON at our persistence layer.
In order to improve the performance and fidelity of our revision history feature, we decided that we should start persisting ‘key-frames’ (snapshots) of our document-state data (rather than just the deltas). Our plan was initially to just gzip the document-state JSON when persisting the snapshots. However, as we started sampling some data and crunching the numbers, we realized that within a year or two we would have hundreds of terabytes of data, costing thousands of dollars per month in infrastructure costs. So, even if we could only reduce the size of our persisted data by a few percentage points, it would translate to real-world savings. Thus, we decided to investigate alternative serialization and compression methods to find the pair would minimize the costs of persisting the new data.
JSON is human readable, relatively concise, simple to understand, and is universally supported. But, its simplicity and human-readability mean it isn’t the most space-efficient format out there.
For example, representing the number
1234.567890123457 will take 18 bytes in UTF-8 stringified JSON. However, a binary format could represent the same number as a 8-byte floating point double. Similarly
false will be 5-bytes in JSON, but a single byte (or conceivably less) in a binary format. Because our document state includes plenty of booleans and numbers, it seemed like a no-brainer that a binary serialization technique would beat out JSON.
We decided to test out the following serialization methods :
- Ion (Both Textual and Binary formats)
As a test bed of documents, we decided to use our system templates (Lucidchart and Lucidpress ‘blueprint’ documents that we provide to our users) as our sample data. We have about 1500 templates totaling 133.8 MB of document-state JSON.
For this set of documents we tried every combination of binary format and compression algorithm (at their various compression levels ). From the tests, we wanted to record three primary metrics:
- The total CPU time to convert from JSON, serialize, and compress
- The final compressed size
- The total CPU time to decompress, deserialize, and then convert back to JSON
So, after running the tests and measuring the data, we ended up with something like this:
|Binary Format||Compression||Compressed Size (bytes)||JSON -> Compressed Time (msec)||Compressed -> JSON Time (msec)|
|… 263 More Rows …|
It was relatively easy to draw a couple simple conclusions from these results. For example, just looking at the uncompressed sizes, Binary Ion was by-far the most compact for our datasets. And looking at the compressed sizes, Textual Ion using the highest level of Brotli compression was the smallest.
Most compact binary formats
|Binary Format||Uncompressed Size (bytes)|
10 most compact compressed formats
|Binary Format||Compression||Compressed Size (bytes)|
|Textual Ion||Brotli (11)||11,903,951|
|Textual Ion||Brotli (10)||12,277,394|
|Textual Ion||XZ (6)||12,840,804|
However, our goal in comparing serialization formats and compression methods was not to simply find the smallest format. Our goal was to minimize our infrastructure costs. So, how do we use all of the measured data to find the optimal solution? Is the space savings by using Brotli 11 worth the extra CPU time it will take to compress?
While impossible to perfectly predict, we actually can use the measured data to provide a pretty good estimate of how many dollars each method would actually cost Lucid. This is because our services are deployed in AWS, and we can choose to pay a ‘fixed cost per CPU second’ by using Lambda, and S3 costs for storing the data are relatively straightforward. So, if we want to calculate an expected costs, we simply need to get some estimates and assumptions for the following:
- The cost per GB, per month to store the data in S3
- The average cost per CPU second
- The expected lifespan of the data
- The expected number of times the persisted data will be used (decompressed and deserialized)
With these estimates and measured results, we can now assign an expected cost for every combination of serialization and compression technique and simply choose the one with the lowest expected cost!
Here are the assumptions that we ended up using:
|Lamda Cost per Second (for a full vCPU)||$0.00002917|
|S3 Standard Cost (per GB/Month)||$0.022|
|S3 IA Cost (per GB/Month)||$0.0125|
Assumptions about our data
|The percent of our snapshots that will end-up in Infrequent Access in S3||95%|
|Expected Lifespan of data||90 months|
|Number of times (on average) we would need to read (decompress and convert back to JSON) a given snapshot||2|
And when you run those numbers, you get the following results:
Final expected pricing results
This shows that, given our measured results and estimated costs, our best bet is using JSON serialization and level 10 Brotli compression (its second highest setting). This represents an expected 16% cost savings over our baseline of JSON and gzip!
We actually tried a variety of assumptions to see if these results held true in different circumstances. Interestingly, JSON was always at the top of the list—and if it wasn’t the absolute best, it was still very competitive (within a couple percentage points of expected cost).
Our analysis reveals a few interesting conclusions that are almost certainly applicable to many others’ circumstances as well.
- Binary formats do result in smaller uncompressed file sizes
- Compressing the serialized data seems to level-the-playing-field and ‘negates’ any wins by using the binary format.
- Thus, the final compressed size of the data has very little to do with the serialization method, and almost everything to do with the compression method.
- Choosing the best compression algorithm is a balancing game between the cost to store the data and the cost to compress the data, but you can choose the right balance according to your expected lifecycle and read patterns.
While it won’t apply to every scenario, the general takeaways and recommendations from the observed data are this:
- If the data originated as JSON, and needs to be converted back to JSON in-order to be used, then JSON is probably going to be the most cost-effective persistence format as well. Just so-long as you choose the appropriate compression algorithm.
- Brotli works really, really well with JSON data. At its higher levels (10 & 11), it can be CPU expensive, but that will be cost-effective when the data has a long lifespan. Brotli also has the advantage that it can be served directly to any modern major browser.
- For data with shorter lifespans, Zstandard (around level 9) offers much better compression than gzip but at roughly the same CPU cost.
We’ve published the full set of data and assumptions here. You’re welcome to take our measured data set and plug in your own own assumptions and costs to see what formats might meet your needs. Of course, our document data won’t be necessarily representative of your data, but this might be a helpful starting point in comparing your options.
1. We only considered serialization formats that were compatible with generic JSON, as opposed to serialization methods that require a pre-defined schema – like Protobuf.
2. Generally, compression functions allow the user to specify the compression level. For Brotli, bzip2, and XZ, we tried all of the available compression levels. Zstandard provides an anostishing 27 different compression levels, so we only tested a subset of those. And while gzip does support different compression levels, the GzipInputStream class included in the JDK doesn’t support it (or at least not very well). Also we went into this with gzip as the baseline, we were not expecting it to be a contender. So, for gzip we just used the default compression level.
Is the test data available?
I have just uploaded it – you can download it using this link.
Very interesting; thanks for the testing.
I noticed in the Google Sheets data that JSON uncompressed has a “JSON -> Compressed” time of 501.4ms and “Compressed -> JSON” time of 1347.9ms. Is that the time of `JSON.stringify` and `JSON.parse`, respectively?
“JSON -> Compressed” is the total time it took to take an in-memory representation of the JSON data-structure (using the Jackson library) to it’s final compressed state. The process was essentially two steps: first, write it to a Byte array using the specified serialization technique (BSON, JSON stringification, MessagePack, etc.), and secondly, run that Byte array through the compression algorithm to get the final compressed result.
“Compressed -> JSON” was the measured time it took to go through those two steps back to in-memory JSON.
Hope that helps!
It would have been nicer if the measurements showed, how much extra time that it would add as part of compression time and decompression times too and also marshaling and unmarshaling the JSON.
Hi! Could you check this compression library in your tests?