Incredible parallel Apache Parquet I / O performance in Python
In this article, I will talk about how Parquet compresses large datasets into a small footprint file, and how we can achieve a bandwidth far exceeding the bandwidth of the I / O stream using concurrency (multithreading).
Apache Parquet: Best at Low Entropy Data
As you can understand from the specification format Apache Parquet, it contains several levels of coding, which allow to achieve a significant reduction in file size, among which are:
- Encoding (compression) using a dictionary (similar to the pandas.Categorical way of presenting data, but the concepts themselves are different);
- Compression of data pages (Snappy, Gzip, LZO or Brotli);
- Encoding of the execution length (for null – dictionary pointers and indexes) and integer bit packing;
To show you how this works, let’s look at a dataset:
['banana', 'banana', 'banana', 'banana', 'banana', 'banana',
'banana', 'banana', 'apple', 'apple', 'apple']
Almost all Parquet implementations use the default dictionary for compression. Thus, the encoded data is as follows:
dictionary: ['banana', 'apple']
indices: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
Indexes in the dictionary are additionally compressed by the repetition coding algorithm:
dictionary: ['banana', 'apple']
indices (RLE): [(8, 0), (3, 1)]
Following the return path, you can easily restore the original array of strings.
In its previous article I created a dataset that compresses very well in this way. When working with pyarrow
, we can enable and disable encoding using the dictionary (which is enabled by default) to see how this will affect file size:
import pyarrow.parquet as pq
pq.write_table(dataset, out_path, use_dictionary=True,
compression='snappy)
A data set that takes up 1 GB (1024 MB) per pandas.DataFrame
, with Snappy compression and dictionary compression, it only takes 1.436 MB, that is, it can even be written to a floppy disk. Without compression using the dictionary, it will occupy 44.4 MB.
Parallel reading in parquet-cpp using PyArrow
In the implementation of Apache Parquet in C ++ – parquet cpp, which we made available to Python in PyArrow, the ability to read columns in parallel was added.
To try this feature, install PyArrow from conda-forge:
conda install pyarrow -c conda-forge
Now when reading the Parquet file you can use the argument nthreads
:
import pyarrow.parquet as pq
table = pq.read_table(file_path, nthreads=4)
For low entropy data, decompression and decoding are strongly tied to the processor. Since C ++ does all the work for us, there are no problems with GIL concurrency and we can achieve a significant increase in speed. See what I was able to achieve by reading a 1 GB dataset in pandas dataframe on a quad-core laptop (Xeon E3-1505M, NVMe SSD):
You can see the full benchmarking scenario here.
I have included performance here for both compression cases using a dictionary and cases without using a dictionary. For data with low entropy, despite the fact that all files are small (~ 1.5 MB using dictionaries and ~ 45 MB without), compression using a dictionary significantly affects performance. With 4 threads, pandas read performance increases to 4 GB / s. This is much faster than the Feather format or any other I know.
Conclusion
With release version 1.0 parquet-cpp (Apache Parquet in C ++) you can see for yourself the increased I / O performance that is now available to Python users.
Since all the basic mechanisms are implemented in C ++, in other languages (for example, R), you can create interfaces for Apache Arrow (columnar data structures) and parquet cpp. Python binding is a lightweight shell of base libraries libarrow and libparquet in C ++.
That’s all. If you want to learn more about our course, sign up for Open Dayto be held today!