why did this work effectively for loading table files

Hi all! My name is Amir, I am a Data Engineer at DUK Technologies. I’ll tell you how we designed and implemented a storage of disparate tabular data using Apache Druid.

Thousands of customer employees create hundreds of tables every day: reports, lists, forecasts, statistics. A customer came to us with a request: to create a temporary data storage for creating analytical showcases.

In the article I will describe why we chose Druid to implement the project, what implementation features we encountered, and how we compared datasource implementation methods.

Task

The task is to create a temporary data store (TDS) to store the JSON/CSV/XLS/XLSX loaded there in databases. Then, based on the data, create analytical showcases.

Required system characteristics:

  • Online data storage with the ability to download JSON/CSV/XLS/XLSX.

  • Supports a large number of simultaneous sessions.

  • Horizontal scaling support.

  • Possibility of creating data marts

  • Ability to access storefronts from BI tools and Dremio

  • Fault tolerance.

Why Apache Druid?

The tool chosen for storing data from files was Apache Druid – it is a real-time columnar database well suited for quickly processing large, rarely changing data sets and providing immediate access to them.

Apache Druid has a unique distributed and elastic architecture that pre-aggregates data from a common data layer into a virtually infinite cluster of data servers. This architecture provides better performance and greater scalability compared to databases such as PostgreSQL and MySQL.

Apache Druid is well suited for creating graphical user interfaces (GUIs) for analytics applications. It also fits well into the backend of high-threaded APIs that require fast data aggregation.

Apache Druid supports many methods and sources for obtaining and storing data and provides a large selection of built-in API to perform certain actions with data and the database itself, which greatly facilitates interaction with the database using external tools and creating your own UI with a back-end.

The Apache Druid tool is used by Airbnb, British Telecom, Paypal, Salesforce, Netflix, Cisco.

Every company uses Apache Druid in a unique way:

  • For example, Netflix uses a tool to aggregate multiple data streams, absorbing up to 2 terabytes of information every hour.

  • Cisco uses Druid in its real-time analytics platform to collect network flow data.

  • Twitch uses Druid to help its employees drill down into high-level metrics rather than read generated reports.

Now a little about the data storage device in Druid. This will be useful in order to understand the difficulties that you will encounter in the future.

At the macro level, all data stored by the Druid database is located in Databases, which in turn are divided into datasources, which in meaning and structure are closest to a “table” from relational databases, but the way they are written to disk is different due to the focus on big data and storing records in columns. Each datasource can have its own schema, loading rules, rotation, etc.

Each record in the datasource must have a timestamp. With its help, data is divided into time periods called time chunks. This is how the principle of partitioning is incorporated into the system design. Also, in addition to time, the data can be further divided using other attributes, but this is not necessary.

Time periods are divided into segments, which are written to permanent storage.

 Scheme 1

Scheme 1

A segment is an archive containing several special files in which some part of the datasource is stored. Data is recorded in it column by column.

Moreover, each segment has its own data schema. It may differ from segment to segment, but any of them will always be a subset of the overall datasource schema.

Implementation of the solution

We worked with Apache Druid version 24.0.1.

To achieve our goals, we created a web interface that connected and sent commands to Apache Druid using its API.

To load user data into Apache Druid, we used the built-in batch input (Native batch ingestion) is a method in which you need to load the so-called load job specification into Druid (ingestion task spechereinafter its), which indicates metadata about the downloaded file (file location, file format, downloaded columns, gradation, etc.).

The specification allows you to quickly and efficiently load data into Druid using specified parameters and settings. The large number of parameters that can be configured helps you fine-tune your data indexing and get optimal performance for your queries in Druid.

There are simple options when data is always loaded from one source, and the data format, number and names of columns are always the same. But in our case, the data in the files could be completely different, and this created some inconvenience, since it was necessary to generate an individual its for each file.

Below are the main parameters that can be used in Ingestion spec:

  • – `type`: data source type (e.g. file, Kafka, JDBC).

  • – `dimensions`: list of all dimensions (fields) that will be indexed.

  • – `metrics`: list of all metrics that will be calculated and stored in the index.

  • – `timestamp`: timestamp field in the data.

  • – `granularity`: defines the size of the time window and the frequency at which the data will be indexed.

  • – `inputFormat`: the data format used in the data source.

  • – `parser`: options used to convert the data into a format that Druid understands.

  • – `tuningConfig`: indexing settings such as segment block size, garbage collection settings and others.

  • – `ioConfig`: Define the actual settings for reading and writing data.

  • – `firehose`: defines how the data stream is received.

Tool interaction

We chose Python and PostgreSQL as backend technologies. Next I’ll tell you why and for what purposes.

General scheme of interaction between tools:

Apache Druid Metadata in PostgreSQL

Apache Druid stores the metadata necessary for the cluster to operate in external storage, and internally stores only the actual data downloaded by the user. Derby, MySQL and PostgreSQL are supported for metadata storage. We chose PostgreSQL because it is a stable and flexible open source solution that is widely used in many large projects.

And, since we planned to have our own web interface with separate authentication, display of download history and other user actions, PostgreSQL also became the data storage for our web interface and user data processing.

Python is the bridge between Apache Druid and our web interface

We used Python as a tool for writing the back-end part of our web interface, in particular we used the library FastAPI for writing microservices called by our front end.

Python was well suited for solving our problems, as it has all the ready-made libraries for interacting with PostgreSQL and for calling and processing third-party APIs.

We wrote several microservices with which the web interface could interact with PostgreSQL and Druid, created functionality for uploading files to the server and then processing them before loading them into Apache Druid.

Dremio as an alternative to Apache Druid's built-in web console

Dremio has a data virtualization engine that connects to various data sources, including databases, data lakes, and cloud storage platforms. The engine then creates a unified view of the data that can be accessed and queried using SQL and other standard tools.

Implementation problems and their solutions

The first big difficulty that we had to solve was the implementation of individual its for each file: different files could have a different number of columns and with different names and gradations.

It was necessary to create a flexible template, its, which could be adjusted at the back-end level at the time of uploading files to the server.

To do this, we created a table in PostgreSQL, where we saved its templates with variables instead of values ​​for parameters and wrote a function that received the necessary values ​​for parameters from Python, read the data from the file and substituted the necessary metadata into the template.

The next difficult step was to solve the problem of storing data in Druid, namely creating a separate datasource for each new file, or creating a separate datasource for each user into which all his files would be loaded. And we conducted a study of the pros and cons of both of these options, one file – one datasource (OFOD), and one user – one datasource (OPOD).

Each of the two options had its own disadvantages and positive aspects.

The OPOD method had the following pros and cons:

About the pros and cons of OFOD:

Based on the combination of pros and cons, we decided to use the OFOD method for the initial implementation.

To clearly separate datasources between users, it was decided that each user’s datasource name would begin with the prefix “login+.+name of the downloaded file,” and each user would have access only to their own datasources.

We also needed to resolve the issue of storing files uploaded via the web interface and then deleting them after the data from these files was loaded into Apache Druid.

It was decided to create a directory on the server in which these files would be stored and deleted on a daily schedule.

Standard Apache Druid workflow:

Storefronts without joint support

The version of Apache Druid we used did not support DDL commands like create table/view; instead, it was possible to create modified copies of existing datasets using INSERT-SELECT or REPLACE-SELECT constructs.

But this could only be done using the built-in web console or using the API; this design did not work with JDBC. Future versions promised to solve this limitation, but we couldn't wait and came up with our own solution.

Implementation results

As a result of implementation, we received a working product:

  • All tables are stored in a single storage.

  • Web interface with JWT authentication.

  • Authentication through State Services.

  • A connector we developed for Dremio ARP, which broadcasts requests from Dremio via JDBC. Allows you to bypass the built-in Apache Druid web console to work with datasets.

  • History of downloaded files, which allows you to download user file data into Apache Druid for further work with third-party BI applications.

What are the product development plans?

  • Apache Druid has its own web interface. The plan is to develop a minimalistic custom console for BI analysts.

  • User management and logging interfaces.

  • Implementation of CI/CD to increase development speed.

  • Monitoring based on Grafana.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *