Sorting cheat sheet for Data Science

image

Sorting data is a primary concern for data scientists and engineers. Python users can choose the most convenient from a number of libraries with built-in, optimized sorting options. Some even work in parallel with the GPU. Surprisingly, some sorting methods do not use these types of algorithms, while others do not work as expected.

Choosing a library and the type of sorting algorithm is not always easy, and innovations are changing at a fast pace. At the moment, the Pandas documentation does not match the code (although personally my PR update of the sorting options was the latest).

In this article I will explain to you what’s what, I’ll give you a couple of tips to help you figure out the methods, and share the results of the speed test.

UPD July 17, 2019: The results of evaluating the speed test now include PyTorch and TensorFlow GPU implementations. TensorFlow also includes CPU results as with tensorflow==2.0.0-beta1, and tensorflow-gpu==2.0.0-beta1. Interesting observations: the PyTorch GPU literally flies, and the TensorFlow GPU is slower than the TensorFlow CPU.

Context

There are many basic sorting algorithms. Some of them have high performance and take up less space, others work well with large amounts of data. For some algorithms, the relative position of the data elements is important. In the diagram at the beginning of the article, you can see the situation in time and volume for the most common algorithms.

In fact, you don’t need to be an expert in basic deployments to solve most sorting problems. In fact, premature optimization is sometimes considered the root of evil. However, if you need to sort a large number of data multiple times, it can be very useful to know in advance which library and which keywords are best used. I present to you my cheat sheet.

image

Over the years, sorting algorithms have changed in most libraries. For analysis in this article, I took the following software versions.

python 3.6.8
numpy 1.16.4
pandas 0.24.2
tensorflow==2.0.0-beta1 #tensorflow-gpu==2.0.0-beta1 slows sorting
pytorch 1.1

Let’s start with the basics.

Python (vanilla)

Python has two built-in sorting methods.

  • my_list.sort() sorts the list in place, with the original list being replaced with the sorted one. sort() returns None.
  • sorted (my_list) returns a sorted copy for each iteration. sorted() returns a sorted iteration. sort()does not change the original list.

In theory, sort() should be faster since sorting is in place. Surprisingly, there is not much difference between them. Also, sort() should be used carefully, as it changes the original data without saving.

All implementations for Vanilla Python, which we will cover in this article, have an ascending default sort order, from smallest to largest. However, most other sorting methods use a top-down method. Not very convenient for you and your head, but this option varies for each library.

For our case, to change the sort order to descending in Vanilla Python, you need to set reverse=True.

key may be declared a keyword for your unique criteria. For instance, sort(key=len) Sorts the list items by their length.

The only sorting algorithm used in Vanilla Python is Timsort. Using this algorithm, data is sorted according to their main criteria. For example, if you want to sort a short list, insertion sorting is used. For more information on Timsort, see Brandon Skerritt’s great article.

Timsort and, accordingly, Vanilla Python, are constant. That is, if the initial values ​​are the same, then the processed values ​​will be similar and arranged in the same order.
To remind the difference between sort () and sorted (), I just notice that sorted () is a more complicated command than sort(), So what sorted() It will take objectively more time, because at the same time both the initial data and the copy are saved. And even if the test results are ambiguous, mnemonics is our everything.

Now I propose to consider the use of Numpy.

Numpy

Numpy is the fundamental Python library for scientific computing. Like Vanilla Python, it has two options for implementation: either copying or changing the array of source data:

  • my_array.sort () changes the array in place and returns a sorted array;
  • np.sort (my_array) returns a copy of the sorted array, without changing the original data.

Arguments used optionally:

  • axis: int, optional – axis on which sorting is performed. By default -1 – sorting by the last axis.
  • kind: {‘Quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’} – sorting algorithm. The default is ‘quicksort’ – “quick sort”. Next I will talk about this in more detail.
  • order: str or list of str – when a is an array with defined boundaries, this argument indicates in which order these boundaries are compared. One field may be specified as a string; further fields may not be specified. They will in any case be used in dtype for interruptions.

Frankly, at the moment, sorting algorithms actually match their names. Team kind=quicksort literally means that sorting starts with introspective sorting. More details here.

If there is no satisfactory result, a switch to the heapsort algorithm occurs. In the worst case, the sorting occurs according to the fast O (n * log (n)) cycle.

stable automatically selects the most appropriate sorting option. This option, along with merge sorting, is currently mapped to timsort or radix sorting, depending on the data type. API direct compatibility currently limits the choice of implementation, and it is tightly bound to various data types.

Timsort is used on pre-prepared or near-final data sorting. According to random data, timsort is almost identical to mergesort. At the moment, it is used for specific sorting, while quick sorting is still implemented by default, and if there is no specific choice … “mergesort” and “stable” are performed automatically for integer data types.
– of Numpy Documents – (after a couple of my edits)

One of the most important conclusions: Numpy allows you to manage sorting options more freely than Vanilla Python. The second conclusion, no less important: the kind keyword does not necessarily correspond to the sort type used. And finally, the final result, so to speak, is that mergesort and stable are specific sorts, and quicksort and heapsort – no.

Numpy is the only method on our list that does not have a keyword for changing the sort order. Fortunately, this can be done with a kind of flip of the array: my_arr[::-1].

All Numpy features are also available in much more user-friendly Pandas.

Pandas

You can sort in Pandas DataFrame with df.sort_values (by=my_column). For convenience, there are several keywords.

  • by: str or list of str – The required name or list of names to sort. If axis = 0 or a specific value, then it may contain value levels / column names. If an axis is equal to 1 or a column, then it may contain column levels and / or value labels.
  • axis: {0 or index, 1 or column}, by default 0 – axis for sorting.
  • ascending: bool or bool list, the default value is True – sort in ascending or descending order. Specify a list for multiple sort orders. If this is a list of bolts, must match the length of the by argument.
  • inplace: bool, default False – if you assign True, the operation is performed on the spot.
  • kind: {quicksort, mergesort, heapsort or stable}, by default quicksort is the choice of sorting algorithm. For more information, see ndarray.np.sort. For DataFrames, this option applies only when sorting by one column or label.
  • na_position: {“First”, “last”}, by default “last” – first puts NaNs at the beginning, last puts NaNs at the end.

Pandas Series is implemented with the same syntax. No Keyword Required in Series bybecause there are no multiple columns with data.

Since Pandas is “under the hood” – Numpy, you have the same optimized sorting options at your fingertips. However, using Pandas is a little more laborious.

When sorting one column in Numpy, the default is quicksort. As you remember, quicksort Now it is actually introductory and goes pyramidal if the sorting process is slow. Pandas says multi-column sorting uses mergesort Numpy mergesort Numpy actually uses Timsort or Radix sorting algorithms. These are stable sorting algorithms, which is necessary when sorting across multiple columns.

There are several key points for Pandas to remember:

  • Function Name: sort_values().
  • Must declare by=column_name or a list of column names.
  • ascending Is the keyword for reverse.
  • For stable sorting use mergesort.

When analytically processing data, I often encounter summing and sorting values ​​in a Pandas DataFrame with Series.value_counts(). Here is a code snippet to summarize and sort the most common values ​​for each column.

for c in df.columns:
    print(f"---- {c} ---")
    print(df[c].value_counts().head())

Daskthat Pandas offer for working with big data does not yet have a parallel sorting implementation, but this one the issue is under discussion.

Pandas sorting is a good choice for pre-sorting small amounts of data. If you have a large amount of data and want to work in parallel with the GPU, you should pay attention to TensorFlow or PyTorch.

Tensorflow

Tensorflow – The most popular medium for deep learning. You can learn more about deep learning in my article here. here. The following information is relevant for the TensorFlow 2.0 GPU.

tf.sort(my_tensor) returns a sorted copy of the tensor. Optional arguments:

  • axis: {int} The axis on which to sort. The default is -1 and sorts the last axis.
  • direction: {ascending or descending} – direction of sorting values.
  • name: {str} – name for the operation.

tf.sort essentially uses top_k(). For top_k() is used cub library for the CUDA GPU, which simplifies parallel implementation. The documentation says that “CUB provides modern and repeatable software components for each level of the CUDA programming model.” TensorFlow uses basic GPU sorting via CUB (discussion)

TensorFlow GPU information can be found here. To activate the capabilities of the GPU with TensorFlow 2.0, you need to register !pip3 install tensorflow-gpu==2.0.0-beta1. Below you will see that you can follow the path tensorflow==2.0.0-beta1if you are only sorting data (which is unlikely).

To check the code on the CPU and GPU, use the following line:

tf.debugging.set_log_device_placement(True)

To indicate that you want to use the GPU, you need to use this block:

with tf.device('/GPU:0'):
%time tf.sort(my_tf_tensor)

To use the CPU: with tf.device('/CPU:0').

tf.sort() This is an intuitive method when working in TensorFlow. Just remember that direction=descending needed to change the sort order.

Pytorch

torch.sort(my_tensor) returns a sorted copy of the tensor. Optional arguments:

  • dim: {int} – sorting volume.
  • descending: {bool} – controls the sort order (ascending or descending).
  • out: {tuple} – output tracking (Tensor, LongTensor), which can be used as output buffers.

If you want to use the GPU for sorting, attach .cuda () to the end of your tensor.

gpu_tensor=my_pytorch_tensor.cuda()
%time torch.sort(gpu_tensor)

Research has shown that PyTorch uses segmented parallel sorting through Thrustif you are sorting a dataset larger than 1 million rows per 100,000 columns.

Unfortunately, I did not have enough memory when I tried to create arbitrary data of 1.1 million per 100 thousand in size through Numpy in Google Colab. After that, I tried GCP with 416 MB of RAM, and again I ran out of memory.

Segmented sorting and sorting by location are high-performance merge sorting options that work with heterogeneous random data. Segmented sorting allows you to sort multiple arrays of variable length in parallel. – https://moderngpu.github.io/segsort.html

Thrust Is a parallel algorithm library that provides compatibility between GPU and multi-core CPU performance. It provides a sorting framework that automatically selects the most efficient implementation method. The CUB library used by TensorFlow lightens the load. PyTorch and TensorFlow use similar algorithms to sort GPUs – no matter which approach you use.

Like TensorFlow, the sorting method in PyTorch is pretty simple to remember: torch.sort(). The only thing to remember is the direction of the sorted values: TensorFlow uses directionand PyTorch – descending. And don’t forget to use .cuda()to maximize speed when working with large amounts of data.

While GPU sorting may be a good option for very large datasets, it also makes sense to sort the data directly in SQL.

SQL

Sorting in SQL is usually very fast, especially if the sorting takes place directly in memory.

SQL is a specification that does not oblige you to use a specific algorithm. Postgres uses disk merge sorting, pyramidal or quick sorting depending on the circumstances. If you have enough memory, sorting can be done in it, and much faster. You can increase the available memory for sorting with настройки work_mem.

Other SQL options use different sorting algorithms. For example, Google BigQuery uses internal sorting with some tricks, like the ones presented in response to stack overflow.

Sorting in SQL is done by the command ORDER BY. This syntax is different from Python, where they prefer to use some form of the word sort. Personally, I remember that ORDER BY is used in SQL syntax, as it is rather unusual.

To sort data in descending order, use the DESC keyword. Thus, a request to return data in alphabetical order from last to first will look like this:

SELECT Names FROM Customers
ORDER BY Names DESC;

Comparison

For each of the above Python libraries, I performed an analysis by sorting 1,000,000 data points in a single column, array, or list. I used a laptop Google colab 2.80 GHz Jupyter with K80 GPU and Intel Xeon CPU.

image

Observations

  • PyTorch with GPUs as fast as possible.
  • For both Numpy and Pandas, in-place sorting is generally faster than copying data.
  • Pandas quicksort by default is pretty quick.
  • Most Pandas features are relatively slower than their Numpy counterparts.
  • The TensorFlow processor is pretty fast. Installing a GPU slows down TensorFlow even when using a CPU. GPU sorting is pretty slow. This option is not very effective.
  • In Vanilla Python, in-place sorting is surprisingly slow — almost 100 times slower than sorting with PyTorch GPU support. I built the experiment several times (with different data) to double-check that this is not an anomaly.

Again, this is just one small test. This is definitely not the end result.

To summarize

As a rule, you do not need your own development of sorting options. Ready-made options are quite satisfactory, and often use more than one sorting method. Instead, they first evaluate the data and then use the proven sorting algorithm. Some versions even modify the algorithms if sorting slows down.

In this article, you figured out how to sort each piece of data in Python and SQL. I hope this helps you in the future. If you have the opportunity, please share the article on your favorite social networks to help other people find it.

You just need to remember which option to choose and call it. Use my cheat sheet to save time. My general recommendations are as follows:

  • Use pandas sort_values() by default for research on relatively small data sets.
  • For large datasets, or when the speed is high enough, try the built-in Numpy sort in place, parallel implementation of the PyTorch or TensorFlow, or SQL GPU.

I haven’t written too much about GPU sorting. This is an area that has matured for new research and study guides. Here is a 2017 research article to give you an idea of ​​the latest research. More information on GPU sorting algorithms can be found. here.

image

Learn the details of how to get a sought-after profession from scratch or Level Up in skills and salary by taking paid SkillFactory online courses:


Read more

  • 450 free courses from the Ivy League
  • Free Data Science Courses from Harvard University
  • 65 free Machine Learning courses from leading universities in the world
  • 30 life hacks to complete the online course
  • The most successful and most scandalous Data Science project: Cambridge Analytica

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *