Incredible parallel Apache Parquet I / O performance in Python

In anticipation of the start of the course “Data Engineer” prepared a translation of a small but interesting material.


In this article, I will talk about how Parquet compresses large datasets into a small footprint file, and how we can achieve a bandwidth far exceeding the bandwidth of the I / O stream using concurrency (multithreading).

Apache Parquet: Best at Low Entropy Data

As you can understand from the specification format Apache Parquet, it contains several levels of coding, which allow to achieve a significant reduction in file size, among which are:

  • Encoding (compression) using a dictionary (similar to the pandas.Categorical way of presenting data, but the concepts themselves are different);
  • Compression of data pages (Snappy, Gzip, LZO or Brotli);
  • Encoding of the execution length (for null – dictionary pointers and indexes) and integer bit packing;

To show you how this works, let’s look at a dataset:

['banana', 'banana', 'banana', 'banana', 'banana', 'banana',
 'banana', 'banana', 'apple', 'apple', 'apple']

Almost all Parquet implementations use the default dictionary for compression. Thus, the encoded data is as follows:

dictionary: ['banana', 'apple']
indices: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]

Indexes in the dictionary are additionally compressed by the repetition coding algorithm:

dictionary: ['banana', 'apple']
indices (RLE): [(8, 0), (3, 1)]

Following the return path, you can easily restore the original array of strings.

In its previous article I created a dataset that compresses very well in this way. When working with pyarrow, we can enable and disable encoding using the dictionary (which is enabled by default) to see how this will affect file size:

import pyarrow.parquet as pq

pq.write_table(dataset, out_path, use_dictionary=True,
               compression='snappy)

A data set that takes up 1 GB (1024 MB) per pandas.DataFrame, with Snappy compression and dictionary compression, it only takes 1.436 MB, that is, it can even be written to a floppy disk. Without compression using the dictionary, it will occupy 44.4 MB.

Parallel reading in parquet-cpp using PyArrow

In the implementation of Apache Parquet in C ++ – parquet cpp, which we made available to Python in PyArrow, the ability to read columns in parallel was added.

To try this feature, install PyArrow from conda-forge:

conda install pyarrow -c conda-forge

Now when reading the Parquet file you can use the argument nthreads:

import pyarrow.parquet as pq

table = pq.read_table(file_path, nthreads=4)

For low entropy data, decompression and decoding are strongly tied to the processor. Since C ++ does all the work for us, there are no problems with GIL concurrency and we can achieve a significant increase in speed. See what I was able to achieve by reading a 1 GB dataset in pandas dataframe on a quad-core laptop (Xeon E3-1505M, NVMe SSD):

You can see the full benchmarking scenario here.

I have included performance here for both compression cases using a dictionary and cases without using a dictionary. For data with low entropy, despite the fact that all files are small (~ 1.5 MB using dictionaries and ~ 45 MB without), compression using a dictionary significantly affects performance. With 4 threads, pandas read performance increases to 4 GB / s. This is much faster than the Feather format or any other I know.

Conclusion

With release version 1.0 parquet-cpp (Apache Parquet in C ++) you can see for yourself the increased I / O performance that is now available to Python users.

Since all the basic mechanisms are implemented in C ++, in other languages ​​(for example, R), you can create interfaces for Apache Arrow (columnar data structures) and parquet cpp. Python binding is a lightweight shell of base libraries libarrow and libparquet in C ++.

That’s all. If you want to learn more about our course, sign up for Open Dayto be held today!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *