DuckDB. OLAP cube in your pocket

Hi, my name is Anton, I am a senior engineer in the analytical solutions department YouMoney. In our company, we use the MSSQL OLAP cube SSAS technology, which has proven itself well – it is relatively easy to deploy and quite productive. But there are a number of disadvantages: Microsoft stopped developing it about the last decade, the technology requires productive servers, and, of course, the issue of dependence on a foreign vendor is also acute. Therefore, looking around in search of an alternative, I decided to try the recently appeared DuckDB technology. I did not have any special hopes, but I wanted to understand what level it is at compared to the tools I am used to.

Brief overview of the technology

DuckDB is an open source project of a portable and high-performance analytical DBMS. It implements a SQL dialect with features beyond basic SQL, supports arbitrary and nested correlated subqueries, window functions, comparisons, complex types, and more.

DuckDB is complemented by a more enterprise-oriented solution, MotherDuck, which addresses distribution and integration issues but is not open source.

In all further examples I use Python.

Getting started is very simple:

con = duckdb.connect(database = "cube.duckdb", read_only = False)  ''' параметр read_only определяет, мы открываем в режиме чтения или записи. Чтение предполагает множественные обращения, а запись всегда эксклюзивна.  '''

There can be only one connection in write mode, but many in read mode.

You can also create an inmemory olap DB:

con = duckdb.connect()

Quickly filled with data from different sources. Below is an example for CSV:

con.execute("CREATE TABLE ExampleTable AS SELECT * FROM read_csv_auto('C:\MyProject\duckdb\seed\ExampleTable.csv');")

And you can immediately execute queries:

df = con.execute("""SELECT * FROM ExampleTable""").df()

It has an extension for working with s3, and can also be distributed via s3:

-- Install extensions
   INSTALL httpfs;
   LOAD httpfs;
   -- Minimum configuration for loading S3 dataset if the bucket is public
   SET s3_region='us-east-1';
 
   CREATE TABLE netflix AS SELECT * FROM read_parquet('s3://duckdb-md-dataset-121/netflix_daily_top_10.parquet');

Has an extension for working with Excel books:

INSTALL spatial;
LOAD spatial;
 
SELECT * FROM st_read('test_excel.xlsx', layer="Sheet1");
     
COPY tbl TO 'output.xlsx' WITH (FORMAT GDAL, DRIVER 'xlsx');

A good tutorial and overview of DuckDB's features is here: Everything you need to know about DuckDB. And in this article, I would like to conduct a comparative experiment on performance.

Preparing the experiment

The most productive way to test technologies is on some real working examples, and for this experiment I chose one of our OLAP cubes. But in fintech it is important to comply with security requirements, so I generated a surrogate data set, as close as possible in structure and characteristics to what we have in production.

The familiar production solution made it easier to further select scenarios. I took the most typical query that users send to our OLAP cube.

Query for a slice of data by amounts, grouped by transaction types and dates, in the MDX version:

SELECT
    NON EMPTY { [Measures].[Сумма] } ON COLUMNS,
    NON EMPTY { ([Дата операции].[Дни].[День].ALLMEMBERS * [Типы транзакций].[Тип транзакции].[Тип транзакции].ALLMEMBERS ) } ON ROWS
FROM (
        SELECT ( [Дата операции].[Дни].[День].&[44927] : [Дата операции].[Дни].[День].&[45443] ) ON COLUMNS
        FROM [OLAP]
     )

In the SQL variant:

SELECT
    t.TRANSACTION_TYPE_NAME
    ,f.DATE
    ,SUM(f.BONUS_SUM) AS xSUM
FROM dbo.OLAP AS f
INNER JOIN dbo.DIM_TRANSACTION_TYPE AS t
    ON f.TRANSACTION_TYPE = t.TRANSACTION_TYPE_ID
GROUP BY GROUPING SETS ((t.TRANSACTION_TYPE_NAME , f.DATE), (t.TRANSACTION_TYPE_NAME ), (f.DATE), ())

To evaluate the technology, I decided to test the speed of these queries first. But it was also interesting to see the volumes of information storage in different versions, and the speed of data loading.

For the experiment, I decided to use my laptop. I did all the test runs on it, so the ratio of metrics will allow me to draw conclusions about the technologies.

Hardware configuration:

Processor: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz

RAM: 32.0 GB

HDD: 500Gb SSD

Experiment

DuckDB captivates with its simplicity of launch. First, let's create a virtual environment:

python.exe -m pip install --upgrade pip
pip install virtualenv
virtualenv venv
venv\Scripts\activate.bat
 
pip install pandasgui
 
pip install duckdb --upgrade

I have prepared CSV files with generated data in advance, let's upload them. The largest one weighs 15Gb (110 million lines):

from datetime import datetime, date, time
import duckdb
 
print (datetime.now())
con = duckdb.connect(database = "cube.duckdb", read_only = False)
 
con.execute("CREATE TABLE OLAP AS SELECT * FROM read_csv_auto('C:\MyProject\duckdb\seed\OLAP.csv');")
 
con.execute("CREATE TABLE DIM_TRANSACTION_TYPE AS SELECT * FROM read_csv_auto('C:\MyProject\duckdb\seed\DIM_TRANSACTION_TYPE.csv');")
print (datetime.now())

Two minutes have passed, our pocket OLAP cube is ready, we can request data. The received file weighs 5 Gb:

from datetime import datetime, date, time
import duckdb
import pandas
from pandasgui import show
 
print (datetime.now())
 
con = duckdb.connect(database = "cube.duckdb", read_only = True)
 
df = con.execute("""SELECT
 t.TRANSACTION_TYPE_NAME
,f.DATE
,sum(f.BONUS_SUM) AS SUM
FROM "cube".main.OLAP as f
INNER JOIN "cube".main.DIM_TRANSACTION_TYPE t
ON f.TRANSACTION_TYPE = t.TRANSACTION_TYPE_ID
GROUP BY GROUPING SETS ((t.DIM_TRANSACTION_TYPE_NAME , f.DATE), (t.DIM_TRANSACTION_TYPE_NAME ), (f.DATE), ())""").df()
 
print (datetime.now())
 
show(df)

Ready!

Seven seconds and the data is ready to work with.

The speed of use is, of course, pleasantly impressive: a couple dozen lines of code plus about half an hour to figure it out and load the data, and the OLAP database is ready.

Comparative tests

For comparison, I ran similar tests for an OLAP cube, for an MSSQL DB with a flat table and page compression level, and for an MSSQL DB with a columnstore and parquet table.

Query execution speed (sec.)

Space Occupied (Gb)

original csv

15

parquet

360

1.5

OLAP

1

7

mssql(compression page) –

93

2.5

mssql(columnstore)

2

0.37

DuckDB (persistent)

7

5

DuckDB (in-memory)

6

Conclusions

Of course, DuckDB metrics are inferior to both OLAP from MSSQL and columnar storage technology from MS SQL. However, both technologies require expensive licenses and a productive server part.

For a solution that can be deployed from a Python library in tens of minutes, I would call metrics excellent. The distribution question remains open. Of course, Mother Duck solves this issue, but it immediately creates vendor dependence and the need to spend money on licenses. Most likely, I will conduct the next experiment using s3.

The second open question is the ease of user entry. SSAS allows you to use Excel as a client, which makes analytics available to a wide range of users. DuckDB and many other OLAP solutions on the market do not allow this and have higher requirements for users. Perhaps an interesting project like Mondrian. Most likely, it will become the subject of one of my studies in the future.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *