A Beginner’s Guide to Columnar File Formats in Spark and Hadoop


This article is a continuation Spark and Hadoop file format guideswhich is a good starting point if you don’t already know anything about big data file formats.

What is the “columnar file format”?

This term is often used, but I’m not sure that everyone is completely clear what it means in practice.

The definition from the tutorial says that columnar (columnar, multi-column, columnar) file formats store data in columns rather than rows. CSV, TSV, JSON and Avro are traditional string file formats. Parquet file and ORC are columnar file formats.

Let’s illustrate the differences between these two concepts using some data examples and a simple visual columnar file format I just came up with.

Here is the data schema for the people set. It’s pretty simple:

{
  id: Integer,
  first_name: String,
  last_name: String,
  age: Integer,
  cool: Boolean,
  favorite_fruit: Array[String]
}

Data in string format string

We could represent this dataset in several formats, for example, here’s an unstructured JSON file:

{"id": 1, "first_name": "Matthew", "last_name": "Rathbone", "age": 19, "cool": true, "favorite_fruit": ["bananas", "apples"]}
{"id": 2, "first_name": "Joe", "last_name": "Bloggs", "age": 102, "cool": true, "favorite_fruit": null}

Here is the same data in everyone’s favorite CSV format

1, Matthew, Rathbone, 19, True, ['bananas', 'apples']
2, Joe, Bloggs, 102, True,

In both of these formats, each line of the file contains a complete line of data. This is how people think of data: neat lines lined up, easy to scan, easy to edit in Excel, easy to read in a terminal or text editor.

Columnar data

For demonstration purposes, I am going to introduce a new columnar (pseudo) file format. Let’s call it CCSV (Columnar CSV).

Each line of our CCSV file has the following content:

Field Name/Field Type/Number of Characters:[data in csv format]

Here is the same data presented in CCSV:

ID/INT/3:1,2
FIRST_NAME/STRING/11:Matthew,Joe
LAST_NAME/STRING/15:Rathbone,Bloggs
AGE/INT/6:19,102
COOL/BOOL/3:1,1
FAVORITE_FRUIT/ARRAY[STRING]/19:[bananas,apples],[]

Please note that all names, ages and favorite fruits are stored next to each other, and not together with other data from the same record. To prevent each section of the column from getting too large, CCSV will repeat this pattern every 1000 records. Thus, a file of 10,000 records will contain 10 sections of grouped columnar data.

Can people read CCSV files easily? In principle, yes, but it will be difficult for you to get a holistic view of the data if you open it in Excel. However, CCSV has several useful properties that make it preferable for computers, and I’ll talk about them now.

Benefits of columnar formats

Optimizing reading

Let’s imagine that I want to execute an SQL query on this data, for example:

SELECT COUNT(1) from people where last_name = "Rathbone"

With a regular CSV file, the SQL engine would have to scan every row, parse every column, extract the value last_nameand then recalculate all values Rathbonethat he will see.

In CCSV, the SQL engine can safely skip the first two fields and just scan the third row, which contains all the available last name values.

Why is it good? Now the SQL engine only processes about 1/6 of the data, i.e. CCSV just provided a (theoretical and completely unreasonable) 600% performance increase over regular CSV files.

Imagine the same gain on a petabyte-scale dataset. It’s not hard to evaluate the optimization of columnar file formats, which saves a ton of processing power (and money) over regular JSON datasets. This is the main value of columnar file formats.

Of course, in fact, CCSV still has a long way to go to become a viable file format, but this takes us a little off topic, so I won’t dwell on it here.

Compression improvements

Sharing data of the same type also offers advantages for compression codecs. Many compression codecs (including GZIP and Snappy) have a higher compression ratio when compressing sequences of similar data. By storing records column by column, in many cases, each section of the column data will contain similar values, making it extremely suitable for compression. In fact, each column can be compressed independently of the others for further optimization.

Disadvantages of columnar formats

The biggest disadvantage of columnar formats is that full write recovery is slower and requires reading segments from each row, one after the other. It is for this reason that columnar file formats are inherently suited for analytics workflows, rather than Map / Reduce-style workflows that by default work on entire rows of data at a time.

For real columnar file formats (e.g. Parquet) this disadvantage is minimized by some clever tricks such as splitting the file into “row groups” and creating rich metadata, although for especially large datasets (eg 200+ columns) the speed impact can be significant.

Another drawback is that they require more CPU and RAM to write, as the file writer has to collect a whole bunch of metadata and reorganize lines before it can write the file.

As a side note, I almost always recommend using a columnar file format if possible so that you can quickly peek into the file and collect some simple metrics.

Real columnar file formats

Now that you know what to expect, let’s take a quick look at the actual file format – Parquet. There are much more detailed tutorials on parquet and I recommend you read official parquet documentation, in particular, to understand how it all works.

Here is a diagram from the Parquet documentation showing the structure of the Parquet file. It’s a little overwhelming, but I think the key takeaway is the importance of organizing data and metadata. I will only give a general outline here, because the documentation is an exhaustive and at the same time a fairly accessible source, if at least to some extent you understand what is happening (and you probably already understand by this point!).

conclusions

For Spark, Spark SQL, Hive, Impala, and other similar technologies, columnar storage can provide up to 100x and sometimes 1000x performance improvements, especially for sparse queries on very wide data sets. Despite their complexity, the files are optimized for computers, and while this makes them difficult to read by a human, they do seem to be a big improvement over formats like JSON or Avro.

Wow, have you read this far?

Thanks for reading to the very end! I guess you are deeply passionate about working with big data if you get to the end. If you ever need to share ideas or get advice on what you are working on, feel free to email me – matthew (at) rathbonelabs (dot com). I love talking about big data with smart people.


Everyone who is interested in the course Data Engineer invite to visit free demo lesson of the course “ML in Spark”… During the lesson, participants will learn about the features of ML in Spark, review the model development process and learn how to translate trained models into production.


GET A DISCOUNT

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *