When not to use lists in Python

Translation of the article prepared in advance of the start of the basic course “Python Developer”.


In Python, the list (list) It is so flexible that it can be used almost everywhere in projects and store various types of data in it: integers, strings, and instances of custom classes. In addition, the list is mutable, which allows us to add or remove items as needed. For the above reasons, some programmers tend to use lists too often and not even consider viable alternatives.

In this article, I would like to highlight five use cases in which you can find a better alternative to lists.

1. Data immunity and hashing – tuples

We do not always need the ability to change the data container. Sometimes, on the contrary, we do not want this. In this case, we need an immutable container for storing our data. Suppose a user selects 4 digits as a password in our application. Theoretically, we can use an object of type list to store these numbers. However, sometimes unforeseen circumstances occur and the password is changed in any way. Consider the following code snippet.

>>> code_list_saved = [5, 9, 0, 3]
>>> code_list_entered = [5, 9, 0, 3]
>>> code_list_saved == code_list_entered
True
>>> code_list_saved[2] = 7
>>> code_list_saved == code_list_entered
False

Initially, these two lists were regarded as identical, and the user could use this combination to unlock protected information. However, now we change one digit in the saved password, and the user loses access. Thus, it is logical that we want the saved password to be unchangeable.

In this case, it is worth considering the use of tuples. As you know, they are immutable objects in Python, which means that the values ​​in them cannot be changed after creation. An alternative implementation using tuples is shown below.

>>> code_tuple_saved = (5, 9, 0, 3)
>>> code_tuple_entered = (5, 9, 0, 3)
>>> code_tuple_saved == code_tuple_entered
True
>>> code_tuple_saved[2] = 7
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'tuple' object does not support item assignment

As you can see from the code above, the saved password is now stored in a tuple. Attempting to change one of the numbers causes an error TypeError, which will prevent any inadvertent data modification. In addition, as a simple security measure, a tuple can be hashed directly in Python. By storing passwords in an object like tuple, you can get a hash value to represent it, which will make your application a little more difficult to crack. Look at the simplified implementation:

>>> code_picked = (5, 9, 0, 3)
>>> stored_code = hash(code_picked)
>>> -8016007988208603090
>>>
>>> code_attempt1 = (9, 5, 0, 3)
>>> hashed_attempt1 = hash(code_attempted)
>>> code_attempt2 = (5, 9, 0, 3)
>>> hashed_attempt2 = hash(code_attempt2)
>>> stored_code == hashed_attempt1
False
>>> stored_code == hashed_attempt2
True
>>> code_picked_list = [5, 9, 0, 3]
>>> hash(code_picked_list)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: unhashable type: 'list'

2. Verification of ownership – sets

Depending on the business tasks of your application, it often happens that you need to check if there is any specific search element in the container. Of course, lists have all the necessary methods for checking ownership. As shown in the code below, we can simply use the keyword in to perform a check that returns a boolean:

>>> integers = [1, 2, 3, 4, 5]
>>> 5 in integers
True
>>> 9 in integers
False

However, if the mechanics of membership verification is common enough in your application, then you should consider using sets instead of lists. Sets are another important type of container in Python. A distinctive feature of the set is that all the elements in it are unique and hashed. The hash requirement is met since Python implements hashes in the form of hash tables under the hood. One of the most significant advantages of using hash tables is the implementation of a constant-time mechanism for searching for specific elements.

Let’s make a simple comparison and see how many sets will bypass lists in performance when it comes to millions of items in containers:

>>> # Import needed modules
>>> from random import randint
>>> from timeit import timeit
>>> 
>>> # Declare a function to measure the time for membership testing
>>> def time_membership_testing(n):
...      integers_list = list(range(n))
...      integers_set = set(range(n))
...      t_list = timeit(lambda : randint(0, 2*n) in integers_list, number=10000)
...      t_set = timeit(lambda : randint(0, 2*n) in integers_set, number=10000)
...      return f"{n: <9} list: {t_list:.4} | set: {t_set:.4}"
... 
>>> numbers = (100, 1000, 10000, 100000)
>>> for number in numbers:
...     print(time_membership_testing(number))
...
100       list: 0.02304 | set: 0.01333
1000      list: 0.1042 | set: 0.01309
10000     list: 0.9028 | set: 0.01713
100000    list: 8.867 | set: 0.01932

As you can see, with an increase in the number of elements in the list, the time for checking for ownership increases linearly. However, this cannot be said about the set where the time for checking ownership remains constant, and more importantly, the necessary time is much less than in the case of lists.

3. Search for values ​​- dictionaries

By analogy with the sets we talked about above, another built-in data type is dict, requires its keys to be hashed.

Key hashing means that the data stored in dictionaries is implemented under the hood as hash tables. Like other popular languages ​​(for example, Java, Swift, Kotlin), Python works with dictionaries using hash tables. However, unlike sets, dictionaries store key-value pairs, and the requirement for hashability of keys is the foundation for constructing hash tables.

When using the hash mechanism, the time required to extract a particular key-value pair remains constant with time complexity O (1) – in notation notation Big o. Time complexity O (1) means that no matter how many elements are in the dictionary, the extraction time for a particular element will always remain the same value. The comparison you see below:

>>> # Import needed modules
>>> from random import randint
>>> from timeit import timeit
>>> 
>>> # Declare a function to measure the time for value retrieval
>>> def time_value_retrieval_testing(n):
...      id_list = list(range(n))
...      score_list = list(range(n))
...      scores_dict = {x: x for x in range(n)}
...      t_list = timeit(lambda : score_list[id_list.index(randint(0, n-1))], number=10000)
...      t_dict = timeit(lambda : scores_dict[randint(0, n-1)], number=10000)
...      return f"{n: <9} list: {t_list:.4} | dict: {t_dict:.4}"
... 
>>> numbers = (100, 1000, 10000, 100000)
>>> for number in numbers:
...     print(time_value_retrieval_testing(number))
... 
100       list: 0.02423 | dict: 0.01309
1000      list: 0.07968 | dict: 0.01322
10000     list: 0.625 | dict: 0.01565
100000    list: 6.223 | dict: 0.01583

Suppose we need to keep the grades of a group of students. If you use a type list, then we will have one list that will store student card numbers, and the second – grades, and the items in these lists will be arranged in accordance with each other. To find out the grade of a particular student, you will need to find out the student’s index from the list of student ID numbers, and then get the grade from the second list by this index. In the dictionary approach, we’ll just keep pairs student_id-scorewhere student card number will be the key. As we saw above, in this case, the priority is to use a dictionary rather than a list, especially when the number of entries is large.

However, it should be noted that since dictionaries store key-value pairs, then your data model should contain meaningful information that could help identify the values. In addition, the keys in the dictionary must be unique, but the values ​​can be the same.

4. First-in-first-out – two-way queue

There are situations when you often need to add and remove elements from the end of the sequence. That is, the processing order of the FIFO elements (First-in-first-out) is implied. In other words, the first item added will be processed first. On lists, we can implement this construction using the function pop (0). However, its implementation requires a lot of time, since the elements must move, which in complexity will correspond O (n).

Data type dequeon the contrary, it is a two-way queue that is designed to quickly add and remove elements from both ends of a sequence. To perform a FIFO operation, Python can directly remove an element at the beginning of a queue without having to shift all elements. So the FIFO operation will be completed quickly. Refer to the comparison below:

>>> # Import needed modules
>>> from collections import deque
>>> from timeit import timeit
>>> 
>>> # Declare a function to measure the time for FIFO
>>> def time_FIFO_testing(n):
...     integer_list = list(range(n))
...     integer_deque = deque(range(n))
...     t_list = timeit(lambda : integer_list.pop(0), number=n)
...     t_deque = timeit(lambda : integer_deque.popleft(), number=n)
...     return f"{n: <9} list: {t_list:.4} | deque: {t_deque:.4}"
... 
>>> numbers = (100, 1000, 10000, 100000)
>>> for number in numbers:
...     print(time_FIFO_testing(number))
... 
100       list: 3.41e-05 | deque: 1.645e-05
1000      list: 0.0004852 | deque: 0.0003466
10000     list: 0.01762 | deque: 0.002618
100000    list: 2.059 | deque: 0.02067

5. Large volumes of tabular data – arrays

Python over time has become widely used in the field of Data Science for processing, analysis and modeling of data. One reason for the rapidly growing popularity is the development of various open source packages. It is important to note that there are implemented custom classes for large data sets. Therefore, instead of lists, we should consider alternatives specifically designed for computing tasks.

For example, if you need to process a large amount of numerical data, you might consider using arrays Numpywhich are the main data type implemented in the package Numpy. If you need to work with structured data where data types are mixed (for example, strings, dates and numbers), you can pay attention to DataFrame from Pandas, which are also the main data type for the package Pandas. If you are in machine learning, you should definitely go deeper into tensorswhich are considered the most important data types in popular machine learning frameworks such as Tensorflow and Pytorch.

On the Internet you can find many tutorials and detailed descriptions of these data structures, since talking about them is somehow beyond the scope of the current article. My opinion is that if you are working with a lot of data, you should definitely pay attention to these alternatives. They have implementations at a lower level, which will help optimize a large number of operations.

Conclusion

Today we looked at five alternatives to lists in Python.
We must remember that we should not be limited to the use of lists, since Python and its packages provide us with different data structures that can offer more optimal solutions for specific business problems.

My advice is to be open to new things and make sure you know your capabilities.


Learn more about the basic course “Python Developer”.


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *