A Magical Mystery Tour of the LowCardinality Data Type
By Alexander Zaitsev
Many ClickHouse features like LowCardinality data type seem mysterious to new users. ClickHouse often deviates from standard SQL and many data types and operations do not even exist in other data warehouses. The key to understanding is that the ClickHouse engineering team values speed more than almost any other property. Mysterious SQL expressions often turn out to be ‘secret weapons’ to achieve unmatched speed.
In fact, the LowCardinality data type is an example of just such a feature. It has been available since Q4 2018 and was marked as production ready in Feb 2019, but still is not documented, magically appearing in some documentation examples. In this article we will fill the gap by explaining how LowCardinality works, and when it should be used.
The Tour Begins
LowCardinality is a data type, or, to put it differently, data type function. It can be used in order to modify any ClickHouse data type, but it is most often used for strings. The magic can be applied to the existing data. We will take the infamous ‘ontime’ dataset as an example. Being relatively small it is not the best for performance optimizations, but sufficient to demonstrate the main idea. The table contains 172M rows, describing US flights for several years. Suppose we are interested to figure out statistics related to top origin cities.
:) select OriginCityName, count() from ontime group by OriginCityName order by count() desc limit 10; ┌─OriginCityName────────┬──count()─┐
│ Chicago, IL │ 10872578 │
│ Atlanta, GA │ 9279569 │
│ Dallas/Fort Worth, TX │ 7760200 │
│ Houston, TX │ 5898651 │
│ Los Angeles, CA │ 5804789 │
│ New York, NY │ 5283856 │
│ Denver, CO │ 5199842 │
│ Phoenix, AZ │ 4891313 │
│ Washington, DC │ 4252095 │
│ San Francisco, CA │ 4027466 │
└───────────────────────┴──────────┘10 rows in set. Elapsed: 2.089 sec. Processed 172.34 million rows, 3.82 GB (82.51 million rows/s., 1.83 GB/s.)`
And then drill down to Chicago for more detail.
:) select OriginCityName, count(), uniq(FlightNum), sum(Distance) from ontime where OriginCityName = 'Chicago, IL'
group by OriginCityName order by count() desc; ┌─OriginCityName─┬──count()─┬─uniq(FlightNum)─┬─sum(Distance)─┐
│ Chicago, IL │ 10872578 │ 7765 │ 8143093140 │ └────────────────┴──────────┴─────────────────┴───────────────┘ 1 rows in set. Elapsed: 2.197 sec. Processed 172.34 million rows, 4.00 GB (78.45 million rows/s., 1.82 GB/s.)
Queries are relatively fast for Amazon t2.medium instances, but we can improve them a lot. Let’s look at how city name columns are stored.
:) SELECT column, any(type),
WHERE (table = 'ontime') AND active AND (column LIKE '%CityName')
GROUP BY column
ORDER BY column ASC
Now magic comes. We will change the type of OriginCityName column to LowCardinality. It can be done with simple spell that looks like an ALTER TABLE statement:
:) ALTER TABLE ontime MODIFY COLUMN OriginCityName LowCardinality(String); 0 rows in set. Elapsed: 19.258 sec.
Alter is performed online, and took 20 seconds in our case. Did it change anything? First of all, let’s look at column storage again.
As we can see and compare to DestCityName, the storage (compressed bytes) has been reduced in 2.5 times, which is substantial. More significantly, the uncompressed size has dropped almost in 10 times!
Does it affect query performance? Sure it does!
Query 1 (note, we do not need to change anything in queries, just re-run the same ones):
10 rows in set. Elapsed: 0.595 sec. Processed 172.34 million rows, 281.33 MB (289.75 million rows/s., 472.99 MB/s.)
1 rows in set. Elapsed: 1.475 sec. Processed 172.34 million rows, 460.89 MB (116.87 million rows/s., 312.54 MB/s.)
The performance of Query 1 has been improved 3.5 times! It only takes the OriginCityColumn. We should really expect a 2.5x improvement, but it is actually more. We will explain later why. The second query has been improved as well, but the difference is only 33%. Our modified column is used for filtering, but other columns still have to be read and processed as before. Can we improve it further? Yes, let’s apply the same magic to the FlightNum column as well.
Run our Query 2.
1 rows in set. Elapsed: 1.064 sec. Processed 172.34 million rows, 549.77 MB (161.98 million rows/s., 516.74 MB/s.)
And observe another 30% improvement!
Let’s summarize results into one table.
So, with simple and quick schema changes we could significantly improve the query performance. As we mentioned above, the ‘ontime’ dataset is not the best for LowCardinality demonstration. City names are relatively short, and flight numbers are even shorter. The effect of LowCardinality could be much more pronounced on longer strings. We have observed x10–20 effect in time series applications, when long path-like structures have been converted to LowCardinality.
Under The Hood
ClickHouse often impresses with its high performance. Sometimes it looks like magic. However, this is a result of very careful and smart engineering. LowCardinality data type is an example. It is the ClickHouse term for dictionary encoding, where strings are encoded as ‘positions’, referencing dictionary with position-to-string mapping. It works the best when source strings are lengthy and the number of distinct values is not that big. There is no hard limit in ClickHouse, the results are usually good if number of distinct values is below 10M. For big tables with many partitions and parts, the total number of distinct values can even be higher, if 10M limit is preserved at part level.
Internally, ClickHouse creates separate file or several files in order to store LowCardinality dictionaries. It can be a single common file per table, if all LowCardinality columns fit into 8192 distinct values, or one file per LowCardinality column if the number of distinct values is higher. The common dictionary is the most efficient, since ClickHouse can cache it, pre-calculate hashes for group by, and do other optimizations.
ClickHouse LowCardinality optimizations go beyond storage. It uses dictionary positions for filtering, grouping, speeding up some functions (e.g. length()) and so on. That’s why we see a bigger improvement in Query 1, than we could expect purely from a storage efficiency prospective. In distributed queries, ClickHouse also tries to operate on dictionary positions for most of query processing, replacing positions with actual strings as late as possible.
If you are interested in more detail, please refer to the following Yandex presentation on this topic: https://github.com/yandex/clickhouse-presentations/raw/master/meetup19/string_optimization.pdf
LowCardinality vs Enum
It is worth mentioning that there is another possibility to encode strings with a dictionary: Enums. ClickHouse supports Enums perfectly. From the storage prospective it may be even more efficient, since enum values are stored on the table definition rather than in a separate data file. Enum works fine for static dictionaries. However, if a value outside the originally enumeration is inserted, ClickHouse will throw an exception. Every change in the set of Enum values requires ALTER TABLE, which may bring a lot of pain. LowCardinality is much more flexible in this regard.
ClickHouse is a feature rich DBMS. It has a lot of carefully crafted technical decisions targeted to the best performance. LowCardinalty is one of those. When properly used it helps to reduce the storage requirements and significantly improve query performance. LowCardinality was a first and most important part of per-column encoding features. In recent releases column level codecs, including delta codec, have been added. We are going to review them in detail soon. Stay tuned!
Originally published at https://www.altinity.com on March 27, 2019.