Back to HN How TimescaleDB compresses time-series databy lkanwoqwp | 65 points | 6 comments | 2026-06-15 12:29:17 Central Open Source Link | Read Source Here Open on Hacker News Commentsgopalv > What does compression do to query performance?That
section is the most relevant whenever compression in a DB
is discussed.The purpose of a database is to find,
aggregate or update data - storage is where the trade-off
gets expressed. There are no silver bullets here.Any
method of compression which speeds up either filter
rejection or scan rate is better than something that only
trades off IO for CPU usage.For example, dictionary
encoding can be slower to read (because you decompress the
whole dictionary and not just the skip read after filter),
but not if you can squeeze out an IN clause by turning
string comparisons into O(1) dictionary followed by a
simple integer filter. Remember, this can be arbitrarily
complex (Druid is a great example of this) and then the
bitmaps can be used because the dictionary index will be a
dense 0-N.Even better if that can feed a deterministic
operation like UPPER() so that you do it over the
dictionary hits once, instead of each row. You can even
use it over the same hash slot, instead of another
dictionary collision check or hash computation.If anyone
is looking at JSONB compression, go take a long look at
the Variant encoding proposals from Databricks/Snowflake
for Iceberg[1].Turning a single column "payload" JSONB
field into chunks which are columnarized and strictly
typed allows you to do all the tricks mentioned here, but
on loosely typed data but chunk by chunk.[1] -
https://github.com/apache/parquet-format/blob/master/Varia
nt...
| > PaulWaldman There's an issue tracking TimescaleDB JSONB
compression:
https://github.com/timescale/timescaledb/issues/2978
| tudorg I have been working on another PG extension for timeseries
(https://github.com/xataio/deltax) for a few months, and
trying to score as good as possible on ClickBench.This is
a project that is simply lot of fun to work on. There are
many tricks that can be used to speed-up analytics,
besides just type-aware compression:* for each segment you
will keep things like max/min/sum, number of distinct
values, bloom filters, etc. For a good amount of common
queries, you can answer them just based on this metadata,
so you don't need to decompress the columns at all.* for
text column, you compress them differently based on
cardinality. Low cardinality (think labels or similar) is
dictionary based compression. High cardinality is LZ4.*
Generally the smaller the data on disk, the higher the
cold runs performance. This is because you need less IO to
load it in memory. I have discovered that on top of the
type-aware compression, it's worth doing another round of
LZ4. There's also some research that it's sometimes worth
doing multiple passes of LZ4.* Partition and segment
pruning. If you can tell from the metadata or bloom
filters that the filter doesn't match a partition or
segment, you skip the whole thing.* Push down of filters
in the decompression layer. Depending on the compression
algorithm, while you decompress you can also filter out
the values that you don't need. This avoids passing data
and allocating memory for elements that will be later
discarded anyway.* Organization of data on disk is more
important than almost anything else. Of course, that's the
main point of columnar storage, but there are level of
details on how to organize the data so that IO is
minimized during queries. I have tried 3-4 different
layouts before settling on one.* For top N type of
queries, which are really common in analytics, you want to
stop the reading from disk / decompressed as soon as you
have enough data to guarantee that you have a correct top
N to satisfy the query.* Parallelize everything: at least
ClickBench runs on instances with a lot of CPU cores, so
you need to parallelize every step of the way. This is
done differently depending on the query type. For example
for top N, each worker can take a subset of the segments
and get the top N from each of them. Then you combine
those in a single result.
| blackoil Gorilla by Facebook had this. Value is stored as delta and
time as delta of delta.
| > lokar They say they are using "gorilla compression "I'm
still amazed every time I go back and read how the
compression for floating point values works.
| > f311a It's used in ClickHouse as well. CH supports all known
compression algos and they are documented pretty well.
|
|