AltinityDB

Mar 27, 2019

6 min read

A Magical Mystery Tour of the LowCardinality Data Type

By Alexander Zaitsev

Many ClickHouse features like LowCardinality data type seem mysterious to new users. ClickHouse often deviates from standard SQL and many data types and operations do not even exist in other data warehouses. The key to understanding is that the ClickHouse engineering team values speed more than almost any other property. Mysterious SQL expressions often turn out to be ‘secret weapons’ to achieve unmatched speed.

In fact, the LowCardinality data type is an example of just such a feature. It has been available since Q4 2018 and was marked as production ready in Feb 2019, but still is not documented, magically appearing in some documentation examples. In this article we will fill the gap by explaining how LowCardinality works, and when it should be used.

The Tour Begins

Query 1:

:) select OriginCityName, count() from ontime group by OriginCityName order by count() desc limit 10; ┌─OriginCityName────────┬──count()─┐ 
│ Chicago, IL │ 10872578 │
│ Atlanta, GA │ 9279569 │
│ Dallas/Fort Worth, TX │ 7760200 │
│ Houston, TX │ 5898651 │
│ Los Angeles, CA │ 5804789 │
│ New York, NY │ 5283856 │
│ Denver, CO │ 5199842 │
│ Phoenix, AZ │ 4891313 │
│ Washington, DC │ 4252095 │
│ San Francisco, CA │ 4027466 │
└───────────────────────┴──────────┘
10 rows in set. Elapsed: 2.089 sec. Processed 172.34 million rows, 3.82 GB (82.51 million rows/s., 1.83 GB/s.)`

And then drill down to Chicago for more detail.

Query 2.

:) select OriginCityName, count(), uniq(FlightNum), sum(Distance) from ontime where OriginCityName = 'Chicago, IL' 
group by OriginCityName order by count() desc;
┌─OriginCityName─┬──count()─┬─uniq(FlightNum)─┬─sum(Distance)─┐
│ Chicago, IL │ 10872578 │ 7765 │ 8143093140 │ └────────────────┴──────────┴─────────────────┴───────────────┘
1 rows in set. Elapsed: 2.197 sec. Processed 172.34 million rows, 4.00 GB (78.45 million rows/s., 1.82 GB/s.)

Queries are relatively fast for Amazon t2.medium instances, but we can improve them a lot. Let’s look at how city name columns are stored.

:) SELECT column, any(type),
sum(column_data_compressed_bytes) compressed,
sum(column_data_uncompressed_bytes) uncompressed,
sum(rows)
FROM system.parts_columns
WHERE (table = 'ontime') AND active AND (column LIKE '%CityName')
GROUP BY column
ORDER BY column ASC

Now magic comes. We will change the type of OriginCityName column to LowCardinality. It can be done with simple spell that looks like an ALTER TABLE statement:

:) ALTER TABLE ontime MODIFY COLUMN OriginCityName LowCardinality(String); 0 rows in set. Elapsed: 19.258 sec.

Alter is performed online, and took 20 seconds in our case. Did it change anything? First of all, let’s look at column storage again.

As we can see and compare to DestCityName, the storage (compressed bytes) has been reduced in 2.5 times, which is substantial. More significantly, the uncompressed size has dropped almost in 10 times!

Does it affect query performance? Sure it does!

Query 1 (note, we do not need to change anything in queries, just re-run the same ones):

10 rows in set. Elapsed: 0.595 sec. Processed 172.34 million rows, 281.33 MB (289.75 million rows/s., 472.99 MB/s.)

Query 2:

1 rows in set. Elapsed: 1.475 sec. Processed 172.34 million rows, 460.89 MB (116.87 million rows/s., 312.54 MB/s.)

The performance of Query 1 has been improved 3.5 times! It only takes the OriginCityColumn. We should really expect a 2.5x improvement, but it is actually more. We will explain later why. The second query has been improved as well, but the difference is only 33%. Our modified column is used for filtering, but other columns still have to be read and processed as before. Can we improve it further? Yes, let’s apply the same magic to the FlightNum column as well.

Before:

After:

Run our Query 2.

1 rows in set. Elapsed: 1.064 sec. Processed 172.34 million rows, 549.77 MB (161.98 million rows/s., 516.74 MB/s.)

And observe another 30% improvement!

Let’s summarize results into one table.

So, with simple and quick schema changes we could significantly improve the query performance. As we mentioned above, the ‘ontime’ dataset is not the best for LowCardinality demonstration. City names are relatively short, and flight numbers are even shorter. The effect of LowCardinality could be much more pronounced on longer strings. We have observed x10–20 effect in time series applications, when long path-like structures have been converted to LowCardinality.

Under The Hood

Internally, ClickHouse creates separate file or several files in order to store LowCardinality dictionaries. It can be a single common file per table, if all LowCardinality columns fit into 8192 distinct values, or one file per LowCardinality column if the number of distinct values is higher. The common dictionary is the most efficient, since ClickHouse can cache it, pre-calculate hashes for group by, and do other optimizations.

ClickHouse LowCardinality optimizations go beyond storage. It uses dictionary positions for filtering, grouping, speeding up some functions (e.g. length()) and so on. That’s why we see a bigger improvement in Query 1, than we could expect purely from a storage efficiency prospective. In distributed queries, ClickHouse also tries to operate on dictionary positions for most of query processing, replacing positions with actual strings as late as possible.

If you are interested in more detail, please refer to the following Yandex presentation on this topic: https://github.com/yandex/clickhouse-presentations/raw/master/meetup19/string_optimization.pdf

LowCardinality vs Enum

Conclusion

Originally published at https://www.altinity.com on March 27, 2019.