For most of the past year I have been working with Hexvarium. Based in Atherton, California, the company builds and manages fiber optic networks. They currently have several networks in the Bay Area but have plans to expand in the US.
My role is to manage a data platform that contains 30 billion records from approximately 70 information sources. This data is used by engineers who develop optimal fiber network deployment plans using LocalSolver. The following is an example of one such plan.
Our data platform mainly consists of PostGIS, ClickHouse and BigQuery. The data comes in a variety of, but often suboptimal, formats. We then shape and enrich the data, often with PostGIS or ClickHouse, before sending it to BigQuery. Engineers working with LocalSolver will then get their data from the cleaned up and up-to-date version in BigQuery.
Analysis fuels the imagination of engineers, which is why we often visualize data generated with Unfolded. Below is a visualization we created for Ookla’s Speedtest last summer. We grouped the records by zoom level H3 9 and took the highest average download speed from each hexagon and plotted them on the map.
Below is another visualization we created showing the difference in maximum broadband speed between June and November last year. Data to the FCC was provided by broadband providers from the United States.
Interesting features of DuckDB
DuckDB is primarily the work of Mark Raasveldt and Hannes Mühleisen. It consists of a million lines of C++ and runs as a standalone binary. Development has been very active, with the number of commits on the GitHub repository doubling almost every year since its launch in 2018. DuckDB uses the PostgreSQL SQL parser, Google’s RE2 regular expression engine, and the SQLite wrapper.
SQLite supports five data types: NULL, INTEGER, REAL, TEXT, and BLOB. This has always frustrated me as dealing with time required transformations in each SELECT statement, and not being able to describe a field as a boolean means that the analysis software could not automatically recognize and provide certain UI controls and rendering of those fields.
Luckily, DuckDB supports 25 data types out of the box, and more can be added using extensions. Work is underway on an extension that will port Paul Ramsey’s PostGIS geospatial functionality to DuckDB. Isaac Brodsky, chief engineer of Unfolded’s parent company Foursquare, also recently published the H3 extension for DuckDB. Although nothing has been published yet, attempts are being made to embed GDAL through its Arrow-based interface in DuckDB.
Min/Max index is created for each column segment in DuckDB. This type of index is why most OLAP databases respond so quickly to aggregation queries. The Parquet and JSON extensions come in the official build and their use is well documented. Snappy and ZStandard compression is supported for Parquet files.
The DuckDB documentation is well organized and very concise, with examples alongside most of the descriptions.
The official version of DuckDB does not contain the Geospatial and H3 extensions used in this post, so I will compile DuckDB with those extensions.
The commands below were run on an e2-standard-4 instance on Google Cloud running Ubuntu 20 LTS. This virtual machine contains 4 vCPUs and 16 GB of RAM. It was launched in Los Angeles, in the us-west2-b region, and has a balanced 100 GB persistent disk. The launch cost $0.18 an hour.
The software packages used in this post will be installed below.
$ sudo apt update $ sudo apt install \ awscli \ build-essential \ libssl-dev \ pigz \ unzip
Compiling the H3 extension requires CMake version 3.20+, so I’ll build it first.
$ cd ~ $ wget -c https://github.com/Kitware/CMake/releases/download/v3.20.0/cmake-3.20.0.tar.gz $ tar -xzf cmake-3.20.0.tar.gz $ cd cmake-3.20.0 $ ./bootstrap --parallel=$(nproc) $ make -j$(nproc) $ sudo make install
The following commands will build both DuckDB and the H3 extension for it. The geo extension is tied to the DuckDB commit c817201, so I’ll also attach the H3 extension to it.
$ git clone https://github.com/isaacbrodsky/h3-duckdb ~/duckdb_h3 $ cd ~/duckdb_h3 $ git submodule update --init $ cd duckdb $ git checkout c817201 $ cd .. $ CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \ make release
Next, an extension will be created that will transfer parts of the PostGIS functionality to DuckDB.
$ git clone https://github.com/handstuyennn/geo ~/duckdb_geo $ cd ~/duckdb_geo $ git submodule init $ git submodule update --recursive --remote $ mkdir -p build/release $ cmake \ ./duckdb/CMakeLists.txt \ -DEXTERNAL_EXTENSION_DIRECTORIES=../geo \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DEXTENSION_STATIC_BUILD=1 \ -DBUILD_TPCH_EXTENSION=1 \ -DBUILD_PARQUET_EXTENSION=1 \ -B build/release $ CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \ cmake --build build/release
I will create a shell script that will run the DuckDB binary with support for unsigned extensions.
$ echo "$HOME/duckdb_geo/build/release/duckdb -unsigned \$@" \ | sudo tee /usr/sbin/duckdb $ sudo chmod +x /usr/sbin/duckdb
I’ll set up the DuckDB config file to load both extensions by default.
$ vi ~/.duckdbrc
.timer on LOAD '/home/mark/duckdb_h3/build/release/h3.duckdb_extension'; LOAD '/home/mark/duckdb_geo/build/release/extension/geo/geo.duckdb_extension';
I will start DuckDB and check if the extensions work.
SELECT h3_cell_to_parent(CAST(586265647244115967 AS ubigint), 1) test; ┌────────────────────┐ │ test │ │ uint64 │ ├────────────────────┤ │ 581764796395814911 │ └────────────────────┘
SELECT ST_MAKEPOINT(52.347113, 4.869454) test; ┌────────────────────────────────────────────┐ │ test │ │ geography │ ├────────────────────────────────────────────┤ │ 01010000001B82E3326E2C4A406B813D26527A1340 │ └────────────────────────────────────────────┘
I find it useful that by default the data type is shown under each field name.
Ookla Speedtest Dataset
Ookla publishes its Speedtest data to AWS in Parquet format. Below I will download the 2022 mobile dataset.
$ aws s3 sync \ --no-sign-request \ s3://ookla-open-data/parquet/performance/type=mobile/year=2022/ \ ./
The folder path structure is designed to support Apache Hive partitioning. I will only be using four Parquet files, so I will move them into one folder.
$ mv quarter\=*/*.parquet ./
Below are the sizes for each of the files. In total, they contain 15,738,442 lines of data.
$ ls -lht *.parquet 173M 2022-10-01_performance_mobile_tiles.parquet 180M 2022-07-01_performance_mobile_tiles.parquet 179M 2022-04-01_performance_mobile_tiles.parquet 174M 2022-01-01_performance_mobile_tiles.parquet
Below is the schema used in these Parquet files.
$ echo "SELECT name, type, converted_type FROM parquet_schema('2022-10*.parquet');" \ | duckdb ┌────────────┬────────────┬────────────────┐ │ name │ type │ converted_type │ │ varchar │ varchar │ varchar │ ├────────────┼────────────┼────────────────┤ │ schema │ BOOLEAN │ UTF8 │ │ quadkey │ BYTE_ARRAY │ UTF8 │ │ tile │ BYTE_ARRAY │ UTF8 │ │ avg_d_kbps │ INT64 │ INT_64 │ │ avg_u_kbps │ INT64 │ INT_64 │ │ avg_lat_ms │ INT64 │ INT_64 │ │ tests │ INT64 │ INT_64 │ │ devices │ INT64 │ INT_64 │ └────────────┴────────────┴────────────────┘
Ookla used Snappy compression for each column. DuckDB can diagnose how the data is located and provide statistical information for each column. Below are the details of the tile column.
$ echo ".mode line SELECT * EXCLUDE (stats_min_value, stats_max_value) FROM parquet_metadata('2022-10-01_*.parquet') WHERE path_in_schema="tile" AND row_group_id = 0;" \ | duckdb file_name = 2022-10-01_performance_mobile_tiles.parquet row_group_id = 0 row_group_num_rows = 665938 row_group_num_columns = 7 row_group_bytes = 31596007 column_id = 1 file_offset = 24801937 num_values = 665938 path_in_schema = tile type = BYTE_ARRAY stats_min = stats_max = stats_null_count = 0 stats_distinct_count = compression = SNAPPY encodings = PLAIN_DICTIONARY, PLAIN, RLE, PLAIN index_page_offset = 0 dictionary_page_offset = 3332511 data_page_offset = 3549527 total_compressed_size = 21469426 total_uncompressed_size = 124569336
Import Parquet Data
I can now import all four Parquet files into DuckDB with a single SQL statement. The command will add an additional column indicating the name of the file from which each line of data comes. From the original 706 MB, Parquet and the extra column turned into a 1.4 GB DuckDB file.
$ echo "CREATE TABLE mobile_perf AS SELECT * FROM read_parquet('*.parquet', filename=true);" \ | duckdb ~/ookla.duckdb
Alternatively, using the DuckDB HTTPFS extension, I can download Parquet files directly from S3.
$ echo "INSTALL httpfs; CREATE TABLE mobile_perf AS SELECT * FROM parquet_scan('s3://ookla-open-data/parquet/performance/type=mobile/year=2022/*/*.parquet', FILENAME=1);" \ | duckdb ~/ookla.duckdb
If it doesn’t already exist, the HTTPFS extension will be downloaded and installed automatically. Keep in mind that sometimes builds aren’t ready yet and you might see the following:
Error: near line 1: IO Error: Failed to download extension "httpfs" at URL "http://extensions.duckdb.org/7813eea926/linux_amd64/httpfs.duckdb_extension.gz" Extension "httpfs" is an existing extension. Are you using a development build? In this case, extensions might not (yet) be uploaded.
DuckDB table structure
Below is the table schema that DuckDB automatically generates.
$ echo '.schema --indent' \ | duckdb ~/ookla.duckdb
CREATE TABLE mobile_perf( quadkey VARCHAR, tile VARCHAR, avg_d_kbps BIGINT, avg_u_kbps BIGINT, avg_lat_ms BIGINT, tests BIGINT, devices BIGINT, filename VARCHAR );
Below is an example entry.
$ echo '.mode line SELECT * FROM mobile_perf LIMIT 1' \ | duckdb ~/ookla.duckdb
quadkey = 0022133222330023 tile = POLYGON((-160.043334960938 70.6363054807905, -160.037841796875 70.6363054807905, -160.037841796875 70.6344840663086, -160.043334960938 70.6344840663086, -160.043334960938 70.6363054807905)) avg_d_kbps = 15600 avg_u_kbps = 14609 avg_lat_ms = 168 tests = 2 devices = 1 filename = 2022-10-01_performance_mobile_tiles.parquet
The following are the compression schemes used in each of the first field segments in the first row group.
$ echo "SELECT column_name, segment_type, compression, stats FROM pragma_storage_info('mobile_perf') WHERE row_group_id = 0 AND segment_id = 0 AND segment_type != 'VALIDITY';" \ | duckdb ~/ookla.duckdb ┌─────────────┬──────────────┬─────────────┬───────────────────────────────────┐ │ column_name │ segment_type │ compression │ stats │ │ varchar │ varchar │ varchar │ varchar │ ├─────────────┼──────────────┼─────────────┼───────────────────────────────────┤ │ quadkey │ VARCHAR │ FSST │ [Min: 00221332, Max: 02123303, … │ │ tile │ VARCHAR │ FSST │ [Min: POLYGON(, Max: POLYGON(, … │ │ avg_d_kbps │ BIGINT │ BitPacking │ [Min: 1, Max: 3907561][Has Null… │ │ avg_u_kbps │ BIGINT │ BitPacking │ [Min: 1, Max: 917915][Has Null:… │ │ avg_lat_ms │ BIGINT │ BitPacking │ [Min: 0, Max: 2692][Has Null: f… │ │ tests │ BIGINT │ BitPacking │ [Min: 1, Max: 1058][Has Null: f… │ │ devices │ BIGINT │ BitPacking │ [Min: 1, Max: 186][Has Null: fa… │ │ filename │ VARCHAR │ Dictionary │ [Min: 2022-10-, Max: 2022-10-, … │ └─────────────┴──────────────┴─────────────┴───────────────────────────────────┘
I’ll start DuckDB with the database I just created, add three columns for zoom levels 7, 8, and 9, and then cook up the hexagon values from the tile’s “geometry” column.
$ duckdb ~/ookla.duckdb
ALTER TABLE mobile_perf ADD COLUMN h3_7 VARCHAR(15); ALTER TABLE mobile_perf ADD COLUMN h3_8 VARCHAR(15); ALTER TABLE mobile_perf ADD COLUMN h3_9 VARCHAR(15); UPDATE mobile_perf SET h3_7 = printf('%x', h3_latlng_to_cell( ST_Y(ST_CENTROID(ST_GEOMFROMTEXT(tile))), ST_X(ST_CENTROID(ST_GEOMFROMTEXT(tile))), 7)::bigint), h3_8 = printf('%x', h3_latlng_to_cell( ST_Y(ST_CENTROID(ST_GEOMFROMTEXT(tile))), ST_X(ST_CENTROID(ST_GEOMFROMTEXT(tile))), 8)::bigint), h3_9 = printf('%x', h3_latlng_to_cell( ST_Y(ST_CENTROID(ST_GEOMFROMTEXT(tile))), ST_X(ST_CENTROID(ST_GEOMFROMTEXT(tile))), 9)::bigint);
Below is an example of an extended notation.
.mode line SELECT h3_7, h3_8, h3_9, tile FROM mobile_perf LIMIT 1 h3_7 = 8730e0ae9ffffff h3_8 = 8830e0ae9dfffff h3_9 = 8930e0ae9d3ffff tile = POLYGON((126.837158203125 37.5576424267952, 126.842651367188 37.5576424267952, 126.842651367188 37.5532876459577, 126.837158203125 37.5532876459577, 126.837158203125 37.5576424267952))
When I tried the above on a 4-core GCP e2-standard-4 VM, I left the job running overnight and it hadn’t completed yet when I checked it in the morning. Later ran the above on a 16-core GCP e2-standard-16 VM and 15.7 million records were enriched in 2 minutes 21 seconds. RAM usage had an upper bound of ~7 GB. I shared this with Isaac to see if there are any obvious fixes that could be applied.
Some databases can create new tables faster than changing existing ones. Didn’t find that this applies to DuckDB. The next attempt was significantly slower than the previous approach.
CREATE TABLE mobile_perf2 AS SELECT quadkey, tile, avg_d_kbps, avg_u_kbps, avg_lat_ms, tests, devices, filename, printf('%x', h3_latlng_to_cell( ST_Y(ST_CENTROID(ST_GEOMFROMTEXT(tile))), ST_X(ST_CENTROID(ST_GEOMFROMTEXT(tile))), 7)::bigint) h3_7, printf('%x', h3_latlng_to_cell( ST_Y(ST_CENTROID(ST_GEOMFROMTEXT(tile))), ST_X(ST_CENTROID(ST_GEOMFROMTEXT(tile))), 8)::bigint) h3_8, printf('%x', h3_latlng_to_cell( ST_Y(ST_CENTROID(ST_GEOMFROMTEXT(tile))), ST_X(ST_CENTROID(ST_GEOMFROMTEXT(tile))), 9)::bigint) h3_9 FROM mobile_perf;
Next, a GZIP-compressed CSV file is created for the contents of the mobile_perf table. This was done for a version of the table that did not contain the H3 hexagons from the enrichment above.
$ echo "COPY (SELECT * FROM mobile_perf) TO 'mobile_perf.csv.gz' WITH (HEADER 1, COMPRESSION gzip);" \ | duckdb ~/ookla.duckdb
Keep in mind that only one process can use DuckDB in write mode at a time, and the above operation requires write mode. If you see something like the following, you need to first wait for another process using the database.
Error: unable to open database "/home/mark/mobile_perf.duckdb": IO Error: Could not set lock on file "/home/mark/mobile_perf.duckdb": Resource temporarily unavailable
If you’re building a file for mass distribution, it might be worth recompressing the above with pigz. Its compression level is configurable, it can take advantage of multiple cores and produce smaller files than GNU GZIP.
The next command completed 1.12 times faster than the previous one, with a 2.5 MB smaller result file.
$ echo ".mode csv SELECT * FROM mobile_perf;" \ | duckdb ~/ookla.duckdb \ | pigz -9 \ > mobile_perf.csv.gz
If file size is not such an issue, using the -1 option will create a file 1.6 times faster, only 1.25 times larger.
Field order and sort keys greatly affect GZIP’s ability to compress. With 8 fields, there are 40,320 field order permutations, so trying to find the most compressible option would take too long. However, reordering the rows by each column and seeing which ones compress the most can be done fairly quickly.
$ COLUMNS=`echo ".mode list SELECT column_name FROM information_schema.columns WHERE table_name="mobile_perf";" \ | duckdb ~/ookla.duckdb \ | tail -n +2` $ for COL in $COLUMNS; do echo $COL echo "COPY (SELECT * FROM mobile_perf ORDER BY $COL) TO 'mobile_perf.$COL.csv.gz' WITH (HEADER 1, COMPRESSION gzip);" \ | duckdb ~/ookla.duckdb done
The above requires write access to the database, so I’ll be executing the commands sequentially rather than in parallel with the xargs.
DuckDB can use multiple cores for most of the above work, but some tasks must be performed sequentially on a single core. It might be worth duplicating the original DuckDB database and testing each sort key against its own database if you want to make the most of the CPU cores available to you.
Sorting by avg_u_kbps resulted in a GZIP-compressed CSV file of 841 MB, while sorting by quadkey resulted in a 352 MB file.
The above took almost all of the 16GB of RAM I had on this system. Keep this in mind for large datasets and/or environments with limited RAM.
ZStandard is a newer lossless compression compressor that should compress just like GZIP, but about 3-6 times faster. I covered this in detail in my minimalist lossless compression guide. If your data consumers can open ZStandard-compressed files, their creation will be much faster.
The following command completed 3.4 times faster than its GZIP counterpart, while the output was about 6% larger.
$ echo "COPY (SELECT * FROM mobile_perf) TO 'mobile_perf.csv.zstd' WITH (HEADER 1, COMPRESSION zstd);" \ | duckdb ~/ookla.duckdb
Again, if the above was sorted by avg_u_kbps, a 742MB ZStandard compressed CSV file would be created, and sorted by quadkey would yield a 367MB file.
My Faster PostgreSQL To BigQuery Transfers blog post discussed converting Microsoft Buildings, a 130M record GeoJSON dataset, to a Snappy-compressed Parquet file. In that post, ClickHouse completed this task 1.38 times faster than its closest competitor. Now I’m running the same workload through DuckDB.
Using an external SSD on my 2020 Intel MacBook Pro, ClickHouse created the file in 35.243 seconds. After working on optimizations with the DuckDB team, I was able to complete the same workload in 30.826 seconds, and if I was willing to process eight Parquet files instead of one, that time was reduced to 25.146 seconds. It was the SQL that ran the fastest.
$ echo "COPY (SELECT * FROM read_ndjson_objects('California.jsonl')) TO 'cali.duckdb.pq' (FORMAT 'PARQUET', CODEC 'Snappy', PER_THREAD_OUTPUT TRUE);" \ | duckdb
Note the use of read_ndjson_objects instead of read_json_objects, this has greatly improved performance. Also, PER_THREAD_OUTPUT TRUE will create multiple files. Clear this option if you want to create one Parquet file.
But with that said, the same workload running on a Google Cloud e2-highcpu-32 VM with Ubuntu 20 LTS showed that ClickHouse outperformed DuckDB by 1.44x. Mark Raasveldt suggested that there could be further improvements in the DuckDB compression analysis phase, so there’s a chance we’ll see a reduction in the performance gap compared to ClickHouse in the near future.
PS. Well, if you don’t want to spend time setting up compilation and want to easily build extensions for a more recent version of duckdb with new features (such as map_entries, map_values, map_keys from #6522), then you can take this docker file for the base.