Everything you need to know about DuckDB
Table of contents
DuckDB is simply the best tool for interacting with data. It allows you to communicate with different services and file formats through a single interface (SQL).
DuckDB has grown to Stable versions 1.0.0. Many small problems were solved and now we have a stable product that can be integrated into our projects.
Since DuckDB has reached 1.0.0, I would like to tell you about this tool and how it can be used in the work of a data engineer.
I initially learned about this tool from a video Nikolay Golov, Nikolay Markov, Phillip Uvarov: Big Data is Dead on the channel { between brackets }.
And I was captivated by this tool. And as I learned more and more about it, my happiness knew no bounds. In this article, I will try to tell you what is really good about DuckDB and how it can help you.
This article will not contain any historical background about what kind of company it is, how the product appeared, etc., only bare facts.
Let's first get to know our duck a little:
Columnar database.
Vector DB.
Has PostgreSQL syntax.
Does not have a cluster.
Positions itself as an “in-process analytical database”.
Free (“MIT License”).
Has the ability to extend (PostgreSQL, S3, HTTP(S), etc.).
I will refer you to it often. DuckDB documentationbecause it is very well written and contains many examples.
SQL
Let's start with some general information. As described in paragraph introduction DuckDB supports PostgreSQL syntax, so it uses “classical” SQL. DuckDB of course has its own methods, but most of the code will be clear to you.
Copy
DuckDB is powered by a powerful engine COPYwhich allows you to quickly fill tables from files and also write our tables directly to files.
Prepared Statements
DuckDB has a cool one the ability to create stateswhich allow you to create a more flexible approach to working with frequent requests and eliminate SQL injections.
Functions
DuckDB has the ability create your own functions and DuckDB also has its own functions for working with dates, arrays, strings, etc.
Extensions
As I wrote earlier, DuckDB has the ability to install extensions. You can also write your own extensions. More details about them are described at link.
PostgreSQL Extension
Thanks to PostgreSQL Extension we can read any database that runs on the PostgreSQL engine (PostgreSQL, GreenPlum).
Parquet
Since DuckDB positions itself as an OLAP database, it supports the ability to work with files of the following format: .parquet
.
Parquet Import
DuckDB allows you to natively read any .parquet
file. This is described in link.
Parquet Export
DuckDB also has a native and easy way to save any calculations directly to .parquet
. This is described in link.
Good Vista
DuckDB positions itself as an “in-process analytical database”, but in open source there is a librarywhich allows you to create a “cluster” of DuckDB and provide connection via a dedicated host.
Practice
We've covered a bit of theory, so let's do some manual work.
Installation
DuckDB is possible install in different waysbut in this article we will use Python API and native .jar
for DBeaver.
Python Client API
Since DuckDB is an “in-process analytical database”, all the code does not need to be physically saved when creating a project, but if we need our tables and calculations again, we will not be able to access them. Therefore, there are two options for how to work with DuckDB.
In-memory
With this approach, all our tables and calculations will not be saved physically and will be deleted when the connection is closed.
import duckdb
cursor = duckdb.connect()
print(cursor.execute('SELECT 42').fetchall())
Physical database
With this approach, we create a physical database and all our tables and calculations will be saved. When closing the connection, we will not lose anything. And if we reconnect to the database, we will be able to read the previously created tables. It is also worth noting that this database can be transferred between each other. It has no dependencies. After working on your machine, you can send it to another person, as everyone is used to transferring “excelki“, you can also transfer the database in the same way.
import duckdb
# create a connection to a file called 'demo_duckdb.db'
cursor = duckdb.connect('demo_duckdb.duckdb')
print(cursor.execute('SELECT 42').fetchall())
cursor.close()
Conversion Result
The result of the query can be obtained in any way convenient for you:
import duckdb
duckdb.sql("SELECT 42").fetchall() # Python objects
duckdb.sql("SELECT 42").df() # Pandas DataFrame
duckdb.sql("SELECT 42").pl() # Polars DataFrame
duckdb.sql("SELECT 42").arrow() # Arrow Table
duckdb.sql("SELECT 42").fetchnumpy() # NumPy Arrays
Data Input
You can run all the code in .py
files, but it's not the prettiest data output for demonstration. So for this project I suggest you use JupyterLab. In order to deploy our project in Docker, it is enough to root repositories execute the command:
docker-compose up -d
And our service will be available at http://localhost:8888/
Once we have deployed our service we can read different data formats.
Installation in JupyterLab
To install DuckDB inside JupyterLab you need to run the command in the terminal or on your laptop:
pip install duckdb==1.0.0
Creating a local database
Let's create and save separately a few commands that we will need.
Creating a connection to the database and creating the database itself physically:
import duckdb
# create a connection to a file called 'demo_duckdb.db'
con = duckdb.connect('demo_duckdb.duckdb')
Closing connection to the database:
con.close()
.json
Documentation:
We can read local glossary.json
and get an immediate result that you can work with:
con.sql(
"""
SELECT * FROM read_json("glossary.json")
"""
).df()
We can get the values we need by key in a convenient format using a dot:
con.sql(
"""
SELECT glossary.GlossDiv.GlossList.GlossEntry.ID FROM read_json("glossary.json")
"""
).df()
And in this way we can “unfold” the whole .json
We can also read .json
with the url specified (remote server, S3, etc):
con.sql(
"""
SELECT * FROM read_json('https://support.oneskyapp.com/hc/en-us/article_attachments/202761627')
"""
).df()
.parquet
Documentation:
con.sql(
"""
SELECT * FROM read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet')
"""
).df()
.csv
There are no interesting examples to demonstrate, so I recommend that you familiarize yourself with official documentation on working with .csv
.
Storage
Important: All further examples were executed via DBeaver. But if you want to execute this code inside the Python API, you need to wrap the SQL code in quotes and write the connection to your database.
Example:
con.sql(
"""
<your SQL-query>
"""
)
We've done a great job with the data that was loaded into our RAM. But now let's try working with the DuckDB storage itself, because it has great speeds.
To begin, let's take the previously used file with taxi trips and create a table from it:
CREATE TABLE yellow_tripdata_2024_01
SELECT * FROM read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet')
It took us 5.77 seconds to create this table. Most of the time was spent on network downloads.
Because if we download this file locally and execute the same script specifying the local file, then loading it into our database will take 1.27 seconds.
Let's count the number of rows in our table:
SELECT COUNT(*) FROM yellow_tripdata_2024_01
And it takes less than a millisecond on average.
If we want to calculate something more complicated, we will get approximately the same results – less than a millisecond per request:
SELECT
tpep_pickup_datetime::date AS date_,
SUM(total_amount) AS sum_,
COUNT(*) AS count_,
SUM(total_amount) / COUNT(*) AS avg_
FROM
yellow_tripdata_2024_01
WHERE
tpep_pickup_datetime BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY 1
ORDER BY 1
With such speeds, DuckDB can be a good tool for analytics and research.
Information_schema
All information about information_schema
you can read it at link.
DuckDB has implemented information_schema
according to SQL standards. Thanks to it, you can get a lot of information about our database.
As an example, we can look at the information by columns in a previously created table:
SELECT * FROM information_schema.columns
WHERE table_name="yellow_tripdata_2024_01"
And as a result, we get 45 columns, which allow us to obtain information on each of the columns in the table.
S3
In the previous article “Infrastructure for data engineer S3″ we considered S3 as a service for data engineers. And now with the help of DuckDB we will try to recreate the work of a data engineer when working with S3.
For this purpose, we are in our existing docker-compose.yml
let's add a service minio
from the previous article to organize the necessary infrastructure for the demonstration. After adding the service minio
you need to run the command docker-compose up -d
.
And now both services are in the same project and are ready to communicate with each other.
Setting up S3
For DuckDB to work correctly, you need:
Tune access key And secret key.
Create a bucket. I'll create it via the web interface to make it easier. But if you want to do it via Python, see the article “Infrastructure for data engineer S3″ describes this process. Important: If you want to work with S3 through a service deployed in Docker, then you need to write
s3_endpoint
Howminio:9000
. And if you work locally through the same DBeaver, then you need to specifylocalhost:9000
Vs3_endpoint
.
Setting up a connection to S3 in DuckDB
To create a connection, you need to set the required parameters.
Important: I will now show the method that is considered obsolete. But due to some issues when working with local S3, I cannot use the new method. Information about the old method; Information about the new method.
To set up a connection to S3 via DuckDB, you need to run the following code:
INSTALL httpfs;
LOAD httpfs;
SET s3_url_style="path";
SET s3_endpoint="localhost:9000";
SET s3_access_key_id = 'rxZaCuukKWeuN4oF08mX';
SET s3_secret_access_key = 'yWJYVFkVxGxuIwDyD1hxepSrO86E816nVkONtgAf';
SET s3_use_ssl = FALSE;
Saving data to S3
Let's now save the table we created earlier. yellow_tripdata_2024_01
in S3. To do this, you need to run the command:
COPY
(
SELECT
*
FROM
main.yellow_tripdata_2024_01
) TO 's3://test-duckdb/yellow_tripdata_2024_01.gzip.parquet'
To check that our file has been saved, we can immediately read it in the following way:
SELECT count(*) FROM 's3://test-duckdb/yellow_tripdata_2024_01.gzip.parquet'
And we will be able to see the same number of lines as in yellow_tripdata_2024_01
.
DuckDB also allows you to save tables in partition format. For this example, we will also use yellow_tripdata_2024_01
and we will save the table with partitioning by days using the following code:
COPY (
SELECT
*,
tpep_pickup_datetime::date AS tpep_pickup_datetime_date
FROM
yellow_tripdata_2024_01
WHERE
tpep_pickup_datetime BETWEEN '2024-01-01' AND '2024-01-31'
)
TO 's3://test-duckdb/yellow_tripdata_partition'
(
FORMAT PARQUET,
COMPRESSION gzip,
PARTITION_BY (tpep_pickup_datetime_date),
OVERWRITE_OR_IGNORE,
FILENAME_PATTERN "file_{i}"
);
Important: If you use a mask file_{uuid}
then with each insertion a new new file will be added to your partition, with a new unique uuid
. And if you choose a mask file_{i}
then each time you insert into the partition you will overwrite the previously created file.
It is also worth noting a cool feature for DuckDB – it works with masks. We previously saved our data in partitioned form and now have the following structure in S3:
test-duckdb/yellow_tripdata_partition/tpep_pickup_datetime_date=2024-01-01/file_0.parquet
test-duckdb/yellow_tripdata_partition/tpep_pickup_datetime_date=2024-01-02/file_0.parquet
...
And each of the elements of our path can be masked with the help of a star. *
.
Reading data from S3
To read all files, regardless of partitions, we will execute the following code:
SELECT
count(*)
FROM
's3://test-duckdb/yellow_tripdata_partition/*/*.parquet'
With the help of our stars we masked the entire path to the file and in this case it does not matter to us in which folders the data is located, how it is divided. Just knowing the structure we mask it. If we want to read a certain path, then we can easily specify it.
Important: A partial mask also works. For example, you can write a mask of this type: s3://test-duckdb/*-foo/*/*.parquet
It is also worth noting that in this query, partitioning and filtering of data works correctly if we execute this code:
SELECT
count(*)
FROM
's3://test-duckdb/yellow_tripdata_partition/*/*.parquet'
WHERE
tpep_pickup_datetime_date="2024-01-10"
Then we will get exactly the required number of lines in 2024-01-10
(95'000).
And if you do it Query Profiling
then we will see the following information:
Our query scanned the partitions and returned the required file.
Tips
Below I will list some interesting features and tips about DuckDB that I cannot highlight as a separate topic.
UNION
A fairly common problem when working with data is changing the data model, and DuckDB has a cool feature for reading data that has a different model.
Let's look at this with an example. First, let's create several files with different data models. To do this, execute the following code:
COPY
(
SELECT
1 AS id,
'Anna' AS first_name,
'Petrova' AS last_name
) TO 's3://test-duckdb/users/1-model/0.gzip.parquet';
COPY
(
SELECT
2 AS id,
'Igor' AS first_name,
'Ivanov' AS last_name,
'2001-01-01'::date AS bd
) TO 's3://test-duckdb/users/2-model/0.gzip.parquet';
COPY
(
SELECT
3 AS id,
'Sergey' AS first_name,
'Shilov' AS last_name,
'1999-01-01'::date AS bd,
'2024-01-01'::date AS created_at
) TO 's3://test-duckdb/users/3-model/0.gzip.parquet';
After execution, we recorded different data models of the same table.
If you run the query:
SELECT * FROM 's3://test-duckdb/users/*/*.gzip.parquet'
Then we get the following result:
id | first_name | last_name |
---|---|---|
1 | Anna | Petrova |
2 | Igor | Ivanov |
3 | Sergey | Shilov |
As we can see, it combined information only for those columns that match, but if we want to get all the columns, then we need to execute the following code:
SELECT
*
FROM
read_parquet(
's3://test-duckdb/users/*/*.gzip.parquet',
union_by_name = TRUE
)
And we get the following result:
id | first_name | last_name | bd | created_at |
---|---|---|---|---|
1 | Anna | Petrova |
|
|
2 | Igor | Ivanov | 2001-01-01 |
|
3 | Sergey | Shilov | 1999-01-01 | 2024-01-01 |
In this form, we can already work with the table without depending on the data model.
JOIN
Actually, it's a pretty obvious hack, but maybe it will be useful to someone: DuckDB has the ability to do JOIN
between data stored in DuckDB itself and on a remote resource. In the example below, we will consider the work of a local database and S3.
To demonstrate, let's prepare the data. Run the code below:
CREATE TABLE users
(
id int4,
first_name varchar(50),
last_name varchar(50),
staff_project int4
);
INSERT INTO users
SELECT
1 AS id,
'Anna' AS first_name,
'Petrova' AS last_name,
10 AS staff_project;
COPY
(
SELECT
10 AS id_project,
'foo' AS name_project
) TO 's3://test-duckdb/project/9999-12-31/0.gzip.parquet'
And now if we execute the query below, we will join the data between the table that is stored locally in our DB and .parquet
file that is located in S3.
SELECT
*
FROM
users AS u
JOIN read_parquet('s3://test-duckdb/project/9999-12-31/0.gzip.parquet') AS p
ON u.staff_project = p.id_project
Result after execution JOIN
:
id | first_name | last_name | staff_project | id_project | name_project |
---|---|---|---|---|---|
1 | Anna | Petrova | 10 | 10 | foo |
DBeaver
You can work with DuckDB not only through Python, but also through a familiar tool for many – DBeaver.
To create a connection to DuckDB, we simply select it through the general interface for creating connections (fork) and select DuckDB. Initially, the DuckDB driver is not downloaded, so DBeaver will offer to download it. But if you want to download a different version of the driver or you have problems with the installation, you can download the driver yourself from the site Maven and when connecting, specify this driver.
Through DBeaver, you can connect to previously created DuckDB databases. For example, I can connect to a DuckDB database that I created earlier through Python and continue working with it through DBeaver.
We also have the opportunity to create our own database, for this we need to click “Create …” and select the path to save the database.
Important: If you do not create the DB locally, all tables created during the session will be deleted. Tables and data are deleted when the connection to the DB is closed.
Export/Import DataBase
During operation, DuckDB can “swell” and to reduce the size of the database itself, you can use the database export.
Export can be done in a format convenient for you.
To export a database to JupyterLab, you need to run the code:
con.sql(
"""
EXPORT DATABASE 'main' (
FORMAT PARQUET,
COMPRESSION ZSTD
);
"""
)
And a folder will be created next to your laptop. main
which will contain all the data and the data schema.
In order to perform a database export via DBeaver, you need to specify the full path for export, an example is below:
EXPORT DATABASE '<path_to_export>' (
FORMAT PARQUET,
COMPRESSION ZSTD
);
This export will also allow you to transfer the results of your calculations.
More information about export and import can be found in documentation.
Conclusion
DuckDB is a very simple and convenient tool. It allows you to work with all modern data formats. Communicate with Data Lake. And if you use the internal DuckDB table engine, your OLAP queries will be executed quickly. And do not forget that SQL is easier to learn than Python, Pandas, PySpark, etc., all analysts are familiar with SQL and therefore I believe that this will be an excellent tool for analytics.
Also, if you need a consultation/mentoring/mock interview and other questions on data engineering, you can contact me. All contacts are listed by link.