Everything you need to know about DuckDB

Table of contents

DuckDB is simply the best tool for interacting with data. It allows you to communicate with different services and file formats through a single interface (SQL).

DuckDB has grown to Stable versions 1.0.0. Many small problems were solved and now we have a stable product that can be integrated into our projects.

Since DuckDB has reached 1.0.0, I would like to tell you about this tool and how it can be used in the work of a data engineer.

I initially learned about this tool from a video Nikolay Golov, Nikolay Markov, Phillip Uvarov: Big Data is Dead on the channel { between brackets }.

And I was captivated by this tool. And as I learned more and more about it, my happiness knew no bounds. In this article, I will try to tell you what is really good about DuckDB and how it can help you.

This article will not contain any historical background about what kind of company it is, how the product appeared, etc., only bare facts.

Let's first get to know our duck a little:

  • Columnar database.

  • Vector DB.

  • Has PostgreSQL syntax.

  • Does not have a cluster.

  • Positions itself as an “in-process analytical database”.

  • Free (“MIT License”).

  • Has the ability to extend (PostgreSQL, S3, HTTP(S), etc.).

I will refer you to it often. DuckDB documentationbecause it is very well written and contains many examples.

SQL

Let's start with some general information. As described in paragraph introduction DuckDB supports PostgreSQL syntax, so it uses “classical” SQL. DuckDB of course has its own methods, but most of the code will be clear to you.

Copy

DuckDB is powered by a powerful engine COPYwhich allows you to quickly fill tables from files and also write our tables directly to files.

Prepared Statements

DuckDB has a cool one the ability to create stateswhich allow you to create a more flexible approach to working with frequent requests and eliminate SQL injections.

Functions

DuckDB has the ability create your own functions and DuckDB also has its own functions for working with dates, arrays, strings, etc.

Extensions

As I wrote earlier, DuckDB has the ability to install extensions. You can also write your own extensions. More details about them are described at link.

PostgreSQL Extension

Thanks to PostgreSQL Extension we can read any database that runs on the PostgreSQL engine (PostgreSQL, GreenPlum).

Parquet

Since DuckDB positions itself as an OLAP database, it supports the ability to work with files of the following format: .parquet.

Parquet Import

DuckDB allows you to natively read any .parquet file. This is described in link.

Parquet Export

DuckDB also has a native and easy way to save any calculations directly to .parquet. This is described in link.

Good Vista

DuckDB positions itself as an “in-process analytical database”, but in open source there is a librarywhich allows you to create a “cluster” of DuckDB and provide connection via a dedicated host.

Practice

We've covered a bit of theory, so let's do some manual work.

Installation

DuckDB is possible install in different waysbut in this article we will use Python API and native .jar for DBeaver.

Python Client API

Since DuckDB is an “in-process analytical database”, all the code does not need to be physically saved when creating a project, but if we need our tables and calculations again, we will not be able to access them. Therefore, there are two options for how to work with DuckDB.

In-memory

With this approach, all our tables and calculations will not be saved physically and will be deleted when the connection is closed.

import duckdb

cursor = duckdb.connect()
print(cursor.execute('SELECT 42').fetchall())

Physical database

With this approach, we create a physical database and all our tables and calculations will be saved. When closing the connection, we will not lose anything. And if we reconnect to the database, we will be able to read the previously created tables. It is also worth noting that this database can be transferred between each other. It has no dependencies. After working on your machine, you can send it to another person, as everyone is used to transferring “excelki“, you can also transfer the database in the same way.

import duckdb

# create a connection to a file called 'demo_duckdb.db'
cursor = duckdb.connect('demo_duckdb.duckdb')
print(cursor.execute('SELECT 42').fetchall())
cursor.close()

Conversion Result

The result of the query can be obtained in any way convenient for you:

import duckdb

duckdb.sql("SELECT 42").fetchall()   # Python objects
duckdb.sql("SELECT 42").df()         # Pandas DataFrame
duckdb.sql("SELECT 42").pl()         # Polars DataFrame
duckdb.sql("SELECT 42").arrow()      # Arrow Table
duckdb.sql("SELECT 42").fetchnumpy() # NumPy Arrays

Data Input

You can run all the code in .py files, but it's not the prettiest data output for demonstration. So for this project I suggest you use JupyterLab. In order to deploy our project in Docker, it is enough to root repositories execute the command:

docker-compose up -d

And our service will be available at http://localhost:8888/

Once we have deployed our service we can read different data formats.

Installation in JupyterLab

To install DuckDB inside JupyterLab you need to run the command in the terminal or on your laptop:

pip install duckdb==1.0.0

Creating a local database

Let's create and save separately a few commands that we will need.

Creating a connection to the database and creating the database itself physically:

import duckdb

# create a connection to a file called 'demo_duckdb.db'
con = duckdb.connect('demo_duckdb.duckdb')

Closing connection to the database:

con.close()

.json

Documentation:

We can read local glossary.json and get an immediate result that you can work with:

con.sql(
    """
    SELECT * FROM read_json("glossary.json")
    """
).df()

We can get the values ​​we need by key in a convenient format using a dot:

con.sql(
    """
    SELECT glossary.GlossDiv.GlossList.GlossEntry.ID FROM read_json("glossary.json")
    """
).df()

And in this way we can “unfold” the whole .json

We can also read .json with the url specified (remote server, S3, etc):

con.sql(
    """
    SELECT * FROM read_json('https://support.oneskyapp.com/hc/en-us/article_attachments/202761627')
    """
).df()

.parquet

Documentation:

con.sql(
    """
    SELECT * FROM read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet')
    """
).df()

.csv

There are no interesting examples to demonstrate, so I recommend that you familiarize yourself with official documentation on working with .csv.

Storage

Important: All further examples were executed via DBeaver. But if you want to execute this code inside the Python API, you need to wrap the SQL code in quotes and write the connection to your database.
Example:

con.sql(
    """
    <your SQL-query>
    """
)

We've done a great job with the data that was loaded into our RAM. But now let's try working with the DuckDB storage itself, because it has great speeds.

To begin, let's take the previously used file with taxi trips and create a table from it:

CREATE TABLE yellow_tripdata_2024_01
SELECT * FROM read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet')

It took us 5.77 seconds to create this table. Most of the time was spent on network downloads.
Because if we download this file locally and execute the same script specifying the local file, then loading it into our database will take 1.27 seconds.

Let's count the number of rows in our table:

SELECT COUNT(*) FROM yellow_tripdata_2024_01

And it takes less than a millisecond on average.

If we want to calculate something more complicated, we will get approximately the same results – less than a millisecond per request:

SELECT 
    tpep_pickup_datetime::date AS date_, 
    SUM(total_amount) AS sum_,
    COUNT(*) AS count_,
    SUM(total_amount) / COUNT(*) AS avg_ 
FROM 
    yellow_tripdata_2024_01
WHERE
    tpep_pickup_datetime BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY 1
ORDER BY 1

With such speeds, DuckDB can be a good tool for analytics and research.

Information_schema

All information about information_schema you can read it at link.

DuckDB has implemented information_schema according to SQL standards. Thanks to it, you can get a lot of information about our database.

As an example, we can look at the information by columns in a previously created table:

SELECT * FROM information_schema.columns
WHERE table_name="yellow_tripdata_2024_01"

And as a result, we get 45 columns, which allow us to obtain information on each of the columns in the table.

S3

In the previous article “Infrastructure for data engineer S3″ we considered S3 as a service for data engineers. And now with the help of DuckDB we will try to recreate the work of a data engineer when working with S3.

For this purpose, we are in our existing docker-compose.yml let's add a service minio from the previous article to organize the necessary infrastructure for the demonstration. After adding the service minio you need to run the command docker-compose up -d.

And now both services are in the same project and are ready to communicate with each other.

Setting up S3

For DuckDB to work correctly, you need:

  1. Tune access key And secret key.

  2. Create a bucket. I'll create it via the web interface to make it easier. But if you want to do it via Python, see the article “Infrastructure for data engineer S3″ describes this process. Important: If you want to work with S3 through a service deployed in Docker, then you need to write s3_endpoint How minio:9000. And if you work locally through the same DBeaver, then you need to specify localhost:9000 V s3_endpoint.

Setting up a connection to S3 in DuckDB

To create a connection, you need to set the required parameters.
Important: I will now show the method that is considered obsolete. But due to some issues when working with local S3, I cannot use the new method.
Information about the old method; Information about the new method.

To set up a connection to S3 via DuckDB, you need to run the following code:

INSTALL httpfs;
LOAD httpfs;
SET s3_url_style="path";
SET s3_endpoint="localhost:9000";
SET s3_access_key_id = 'rxZaCuukKWeuN4oF08mX';
SET s3_secret_access_key = 'yWJYVFkVxGxuIwDyD1hxepSrO86E816nVkONtgAf';
SET s3_use_ssl = FALSE;

Saving data to S3

Let's now save the table we created earlier. yellow_tripdata_2024_01 in S3. To do this, you need to run the command:

COPY
(
  SELECT
    *
  FROM
    main.yellow_tripdata_2024_01
) TO 's3://test-duckdb/yellow_tripdata_2024_01.gzip.parquet'

To check that our file has been saved, we can immediately read it in the following way:

SELECT count(*) FROM 's3://test-duckdb/yellow_tripdata_2024_01.gzip.parquet'

And we will be able to see the same number of lines as in yellow_tripdata_2024_01.

DuckDB also allows you to save tables in partition format. For this example, we will also use yellow_tripdata_2024_01 and we will save the table with partitioning by days using the following code:

COPY (
    SELECT
		*,
		tpep_pickup_datetime::date AS tpep_pickup_datetime_date
	FROM
		yellow_tripdata_2024_01
	WHERE
		tpep_pickup_datetime BETWEEN '2024-01-01' AND '2024-01-31'
)    
TO 's3://test-duckdb/yellow_tripdata_partition'
(
  FORMAT PARQUET,
  COMPRESSION gzip,
  PARTITION_BY (tpep_pickup_datetime_date),
  OVERWRITE_OR_IGNORE,
  FILENAME_PATTERN "file_{i}"
);

Important: If you use a mask file_{uuid}then with each insertion a new new file will be added to your partition, with a new unique uuid. And if you choose a mask file_{i}then each time you insert into the partition you will overwrite the previously created file.

It is also worth noting a cool feature for DuckDB – it works with masks. We previously saved our data in partitioned form and now have the following structure in S3:

test-duckdb/yellow_tripdata_partition/tpep_pickup_datetime_date=2024-01-01/file_0.parquet
test-duckdb/yellow_tripdata_partition/tpep_pickup_datetime_date=2024-01-02/file_0.parquet
...

And each of the elements of our path can be masked with the help of a star. *.

Reading data from S3

To read all files, regardless of partitions, we will execute the following code:

SELECT
	count(*)
FROM
	's3://test-duckdb/yellow_tripdata_partition/*/*.parquet'

With the help of our stars we masked the entire path to the file and in this case it does not matter to us in which folders the data is located, how it is divided. Just knowing the structure we mask it. If we want to read a certain path, then we can easily specify it.
Important: A partial mask also works. For example, you can write a mask of this type: s3://test-duckdb/*-foo/*/*.parquet

It is also worth noting that in this query, partitioning and filtering of data works correctly if we execute this code:

SELECT
	count(*)
FROM
	's3://test-duckdb/yellow_tripdata_partition/*/*.parquet'
WHERE
	tpep_pickup_datetime_date="2024-01-10"

Then we will get exactly the required number of lines in 2024-01-10 (95'000).

And if you do it Query Profilingthen we will see the following information:
Our query scanned the partitions and returned the required file.

Tips

Below I will list some interesting features and tips about DuckDB that I cannot highlight as a separate topic.

UNION

A fairly common problem when working with data is changing the data model, and DuckDB has a cool feature for reading data that has a different model.

Let's look at this with an example. First, let's create several files with different data models. To do this, execute the following code:

COPY
(
	SELECT
		1 AS id,
		'Anna' AS first_name,
		'Petrova' AS last_name
) TO 's3://test-duckdb/users/1-model/0.gzip.parquet';

COPY
(
	SELECT
		2 AS id,
		'Igor' AS first_name,
		'Ivanov' AS last_name,
		'2001-01-01'::date AS bd 
) TO 's3://test-duckdb/users/2-model/0.gzip.parquet';	

COPY
(
	SELECT
		3 AS id,
		'Sergey' AS first_name,
		'Shilov' AS last_name,
		'1999-01-01'::date AS bd,
		'2024-01-01'::date AS created_at
) TO 's3://test-duckdb/users/3-model/0.gzip.parquet';

After execution, we recorded different data models of the same table.

If you run the query:

SELECT * FROM 's3://test-duckdb/users/*/*.gzip.parquet'

Then we get the following result:

id

first_name

last_name

1

Anna

Petrova

2

Igor

Ivanov

3

Sergey

Shilov

As we can see, it combined information only for those columns that match, but if we want to get all the columns, then we need to execute the following code:

SELECT
	*
FROM
	read_parquet(
		's3://test-duckdb/users/*/*.gzip.parquet',
		union_by_name = TRUE
	)

And we get the following result:

id

first_name

last_name

bd

created_at

1

Anna

Petrova

[NULL]

[NULL]

2

Igor

Ivanov

2001-01-01

[NULL]

3

Sergey

Shilov

1999-01-01

2024-01-01

In this form, we can already work with the table without depending on the data model.

JOIN

Actually, it's a pretty obvious hack, but maybe it will be useful to someone: DuckDB has the ability to do JOIN between data stored in DuckDB itself and on a remote resource. In the example below, we will consider the work of a local database and S3.

To demonstrate, let's prepare the data. Run the code below:

CREATE TABLE users
(
	id int4,
	first_name varchar(50),
	last_name varchar(50),
	staff_project int4
);

INSERT INTO users
SELECT
	1 AS id,
	'Anna' AS first_name,
	'Petrova' AS last_name,
	10 AS staff_project;

COPY
(	
	SELECT
		10 AS id_project,
		'foo' AS name_project
) TO 's3://test-duckdb/project/9999-12-31/0.gzip.parquet'	

And now if we execute the query below, we will join the data between the table that is stored locally in our DB and .parquet file that is located in S3.

SELECT
	*
FROM
	users AS u
JOIN read_parquet('s3://test-duckdb/project/9999-12-31/0.gzip.parquet') AS p
	ON u.staff_project = p.id_project

Result after execution JOIN:

id

first_name

last_name

staff_project

id_project

name_project

1

Anna

Petrova

10

10

foo

DBeaver

You can work with DuckDB not only through Python, but also through a familiar tool for many – DBeaver.

To create a connection to DuckDB, we simply select it through the general interface for creating connections (fork) and select DuckDB. Initially, the DuckDB driver is not downloaded, so DBeaver will offer to download it. But if you want to download a different version of the driver or you have problems with the installation, you can download the driver yourself from the site Maven and when connecting, specify this driver.

Through DBeaver, you can connect to previously created DuckDB databases. For example, I can connect to a DuckDB database that I created earlier through Python and continue working with it through DBeaver.

We also have the opportunity to create our own database, for this we need to click “Create …” and select the path to save the database.

Important: If you do not create the DB locally, all tables created during the session will be deleted. Tables and data are deleted when the connection to the DB is closed.

Export/Import DataBase

During operation, DuckDB can “swell” and to reduce the size of the database itself, you can use the database export.

Export can be done in a format convenient for you.

To export a database to JupyterLab, you need to run the code:

con.sql(
    """
    EXPORT DATABASE 'main' (
        FORMAT PARQUET,
        COMPRESSION ZSTD
    );	
    """
)

And a folder will be created next to your laptop. mainwhich will contain all the data and the data schema.

In order to perform a database export via DBeaver, you need to specify the full path for export, an example is below:

EXPORT DATABASE '<path_to_export>' (
    FORMAT PARQUET,
    COMPRESSION ZSTD
);

This export will also allow you to transfer the results of your calculations.

More information about export and import can be found in documentation.

Conclusion

DuckDB is a very simple and convenient tool. It allows you to work with all modern data formats. Communicate with Data Lake. And if you use the internal DuckDB table engine, your OLAP queries will be executed quickly. And do not forget that SQL is easier to learn than Python, Pandas, PySpark, etc., all analysts are familiar with SQL and therefore I believe that this will be an excellent tool for analytics.

Also, if you need a consultation/mentoring/mock interview and other questions on data engineering, you can contact me. All contacts are listed by link.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *