Managing ClickHouse Datasets with ad-cli
By Robert Hodges
Large datasets are critical for anyone trying out or testing ClickHouse. ClickHouse is so fast that you typically need at least 100M rows to discern differences when tuning queries. Also, killer features like materialized views are much more interesting with large volumes of diverse data. Despite the importance of such datasets to ClickHouse users, there is little tooling available to help manage them easily.
By contrast Python libraries like scikit-learn have a nice ability to reload data from standard collections like the famous Iris dataset. Inspired by their example we have developed a simple tool that makes it easy to capture and reload ClickHouse datasets that run to billions of rows. Our tool is called an ad-cli. It solves two problems at once:
- Define a portable format for datasets
- Provide a tool to create and load such datasets easily.
The sections that follow introduce ad-cli and describe how it works, along with the dataset format. We’ll conclude with a summary of the roadmap. This is just the first step toward what we plan to develop into powerful bulk data transfer capability as well as a family of large test datasets that anyone can load quickly.
Ad-cli is written in Python and runs on Linux systems. Like ClickHouse it uses an Apache 2.0 license. Ad-cli requires python3, pip3, and clickhouse-client to run. Here’s a typical installation command to install the latest version from pypi.org.
sudo apt install clickhouse-client
sudo pip3 install altinity-datasets
You can also install straight from the source on Github.
pip3 install git+https://github.com/altinity/altinity-datasets.git
Once installed you can check that ad-cli is working by typing ‘ad-cli version’, which should result in something like the following:
$ ad-cli version
ad-cli 0.1.2, Python 3.7.1
Loading a Dataset
Ad-cli has some small built-in datasets. You can find out what’s available using the dataset search command, as shown below. This will print out a library listing of datasets.
$ ad-cli dataset search --full=false
name : iris
title : Iris Flower Data Set
description: Famous iris plant data set published in Ronald Fisher's 1936 paper "The use of multiple measurements in taxonomic problems." Commonly used to illustrate statistical and machine learning procedures.
size : 150 rows
sources : ['https://archive.ics.uci.edu/ml/datasets/iris']
notes : None
repo : built-ins
path : /home/rhodges/git/altinity-datasets/my-env/lib/python3.7/site-packages/altinity_datasets/built-ins/iris
Assuming you have a local clickhouse-client server available, you can load one of the datasets with a command like the following.
$ ad-cli dataset load iris
Creating database if it does not exist: iris
Executing DDL: /home/rhodges/git/altinity-datasets/my-env/lib/python3.7/site-packages/altinity_datasets/built-ins/iris/ddl/iris.sql
Loading data: table=iris, file=iris.csv
Operation summary: succeeded=1, failed=0
This command does three things:
- Creates a database on the server if it does not exist.
- Runs a SQL command to create the database.
- Loads the table data from CSV.
That’s it. ClickHouse now has a database named iris. In the database is a table named iris with 150 rows.
Dumping and Reloading Datasets
Ad-cli can create datasets from existing tables in ClickHouse. Let’s say we want to dump the iris dataset to files so we can reload it somewhere else. You can do that with the following command:
$ ad-cli dataset dump --repo-path=/tmp --database=iris my-iris
Preparing dataset directory: /tmp/my-iris
Computing data set size
Table: iris Rows: 150
Total rows: 150
Writing manifest: /tmp/my-iris/manifest.yaml
Dumping data: table=iris, partition=Iris-setosa
Dumping data: table=iris, partition=Iris-versicolor
Dumping data: table=iris, partition=Iris-virginica
Operation summary: succeeded=3, failed=0
Let’s have a look at what ad-cli created.
$ find /tmp/my-iris -print
/tmp/my-iris/ddl/iris.sql <-- Table definition
/tmp/my-iris/manifest.yaml <-- Description of dataset
/tmp/my-iris/data/iris/data-Iris-versicolor.csv <-- Data in CSV
The ddl directory contains zero or more scripts to run before loading data. The manifest.yaml file is a formatted YAML file with descriptive information about the dataset. Finally, the data directory contains table data. Each table has its own subdirectory, and data files contain data by partition key, which make loading and merging more efficient. You can take the files apart yourself and look at how they work — it’s really simple. Further details are available in the README.md file.
We can turn around and reload the new dataset to another database as follows:
$ ad-cli dataset load --repo-path=/tmp --database=iris25 my-iris
That’s it! You can clone and reload small datasets from live data in a few seconds. We’re ready for something bigger.
Handling Larger Datasets
Ad-cli has two features that enable it to work efficiently with large datasets.
- Parallel operations — Running multiple clickhouse-client processes in parallel
- Compression — ad-cli can write and read CSV files using gzip
Let’s now capture something big. One of our favorite datasets at Altinity is the famous NYC Taxi Dataset. To dump data we’ll run up to 6 clickhouse-client processes in parallel and ask for data to be compressed. Here’s the command:
ad-cli dataset dump nyc_taxi_rides --repo-path=. \
On an Amazon md5.2xlarge with 8 vCPUs this command completes in 34.36 minutes and creates an on-disk dataset that is 38 GB in size. The setting for — parallel can vary according to the capacities of the host. In this case, we are also running the clickhouse-server process locally. 6 processes give reasonable performance and results in 100% CPU utilization with a load average of close to 8, which means that processors are fully used without excessive process switching.
You can now rsync data to another host and load there. Or load directly to the other host as shown in the example below.
ad-cli dataset load nyc_taxi_rides --host=other.host.com \
--repo_path=. --compress --parallel=6
Running on the same host as before this command completes in a little over 11 minutes. One of the main reasons for the big difference between dump and load is that gzip commands have highly asymmetrical performance for compressing and decompressing data. The difference in this example is around 6x, as in ~36s to compress 10 million lines of CSV data versus ~6s to decompress the same amount.
To keep things simple we have shown examples of ad-cli largely using default connection parameters. Ad-cli also supports extra parameters to set the user/password, change the host and port, and use secure connections. Here’s an example:
ad-cli dataset load iris -H a.secure.host.com -P 9440 \
-u special -p secret --secure
For more information check out the README.md or use the — help/-h option on individual commands.
There are a number of features we could add to make ad-cli more robust and useful. The most important is to support the full range of formats that ClickHouse can handle. That said, our long-term priority is to build generalized bulk load with the following important features:
- Provide fast and flexible loading of individual tables including supporting different formats, automatic retry on failure, and trapping erroneous rows in a separate file for correction and resubmission.
- Add support for dumping and loading directly via object storage, such as Amazon S3.
- Create a library of public datasets for ClickHouse that can be easily located and loaded. Ad-cli already introduces the concept of a repo for this reason.
This article introduced ad-cli, a simple tool to dump and load ClickHouse datasets. It provides a basic format for datasets. It also dumps and loads datasets via that format using one-line commands.
We would like to invite you to try ad-cli and see what you think. Please send feedback via Github issues. If you fix a bug or add your own features, send us a pull request. We gladly accept contributions from the community. Thanks and enjoy!
Originally published at https://www.altinity.com on July 1, 2019.