10 Python Libraries for Machine Learning – A Beginner's Guide

We compiled a list of the most important Python libraries for machine learning and told what tasks they can be useful for beginner ML engineers and Data Science specialists. The selection was compiled by Kirill Simonov — ML developer at IRLIX with expertise in computer vision.

NumPy

NumPy — a library for working with mathematical calculations, linear algebra and statistics methods. It is used not only in ML — most libraries for machine learning rely on its capabilities.

NumPy provides many functions for working with numeric data as N-dimensional matrices. This format is more convenient and faster to work with than Python's built-in data structures such as lists and tuples. Therefore, using matrices improves computational performance.

How is it used in ML

For ease of use, tabular data, texts, images and sounds are presented in numerical form. More precisely, in the form of special structures: matrices, vectors, tensors. NumPy is needed to work with these structures: modify, transform, perform complex mathematical operations.

Most often, NumPy is used together with other libraries, such as TensorFlow, PyTorch, Pandas. It also underlies other ML tools. In projects, NumPy is often used as a link between different stages of data processing.

Example of a solution to a problem

A simple example of working with NumPy is mathematical operations on two-dimensional matrices. The code below calculates the scalar product of two matrices. This is one of the basic operations used in ML and building neural networks.

The dot operation in NumPy automatically calculates the dot product of two matrices. Source

The dot operation in NumPy automatically calculates the dot product of two matrices. Source

Pandas

Pandas — is more of a data analysis library, but it is often used to pre-process information before loading it into an ML model. With Pandas, you can perform various operations on data: read it, filter it, and bring it to a uniform form, collect it into uniform structures, and prepare it for loading into a model.

How is it used in ML

Pandas is used for preliminary preparation and processing of information. The library allows you to:

  • quickly read information from various sources – from Excel documents to relational databases, where information is presented in the form of linked tables;

  • index data, relate and combine them;

  • examine, clean and transform information for loading into a model;

  • manipulate complex data structures using a few lines of code;

  • work with time series, etc.

Example of a solution to a problem

Loading and processing data using the library is relatively simple and convenient. In the code below, Pandas helps import data from a CSV table, and then filter and display information only about employees with a salary above 50,000.

You only need two Pandas commands: one to load from CSV and one to filter the data. Source

You only need two Pandas commands: one to load from CSV and one to filter the data. Source

In addition to loading from a file, data can be taken from Python structures themselves – for example, lists, dictionaries, or multidimensional NumPy arrays.

Create a Pandas table from a list of dictionaries in Python. Source

Create a Pandas table of data from a list of dictionaries in Python. Source

Scikit-learn

Library is based on NumPy and SciPy, and is used in data analysis and traditional machine learning that does not use neural networks. Scikit-learn contains algorithms and tools for building models, processing and classifying data, and evaluating results.

How is it used in ML

Scikit-learn is not used in large projects due to the difficulties with optimizing calculations on big data. But it is used to quickly test hypotheses about data or ways to solve a problem – the library contains a huge number of ML algorithms, and you can test an idea in a few lines.

The library is also often recommended for beginners to learn – it is quite easy to use, it has clear and detailed documentation. This is a good option for those who want to quickly get into the theoretical basis of ML and practice.

Example of a solution to a problem

Scikit-learn has many ready-made functions for the main ML algorithms. For example, you can build a linear regression model, load data into it, and test a hypothesis using the library in three lines of code. Each of these actions already has its own command.

A simple implementation of linear regression using Scikit-learn. The code will return the parameters of a point that can be placed on a line formed using the coordinates from the dataset. Source

A simple implementation of linear regression using Scikit-learn. The code will return the parameters of a point that can be placed on a line formed using coordinates from a dataset. Source

XGBoost / LightGBM / CatBoost

We have combined three machine learning libraries into one item because they solve the same problems using the same tool – gradient boosting. This is a machine learning technique that uses a structure of sequentially built ML models to achieve results. This process requires a lot of resources, so it is important that the algorithms work quickly and efficiently.

The three libraries implement gradient boosting slightly differently. Each has its own features:

  • XGBoost is considered one of the fastest and most productive ML libraries. It appeared earlier than others and is still considered the main one, used in high-load systems, among other things;

  • LightGBM is known for its high speed of operation and the peculiarity of constructing ML algorithms of decision trees, which are the basis of boosting. For some tasks, it gives more accurate results on large volumes of data;

  • CatBoost provides high accuracy when working with categorical data, that is, data that can take on a limited number of values.

How are they used in ML?

Before the development of neural network technologies, these libraries were considered State of the artthat is, the best tools in their field (classification and regression). Now they continue to be used in production – commercial development – where they still successfully compete with neural networks.

Example of a solution to a problem

This is what a simple model for solving a classification problem looks like, created with XGBoost. This example uses a standard test data set Pima Indians Dataset — it contains information about different groups of patients and their risk of diabetes. The model will determine how likely a new value is to belong to a particular class — whether a person has diabetes or not.

This is an implementation of the XGBClassifier model from XGBoost. The code uses NumPy and Scikit-learn to prepare the data for it. Source

This is an implementation of the XGBoost XGBClassifier model. The code uses NumPy and Scikit-learn to prepare the data for it. Source

PyTorch

Library for artificial intelligence and neural networks. PyTorch can build classic neural network architectures using ready-made blocks, and if necessary, solve lower-level problems, such as optimizing calculations on the GPU graphics processor.

How is it used in ML

This library is considered the basis for solving an entire ecosystem of problems: computer vision, natural language processing, training robots and agents in virtual and real environments.

Example of a solution to a problem

PyTorch makes it easy to create simple neural network models. For example, the code below describes a neural network with two linear layers, with an activation function for each layer. This network classifies input data into one category or another.

Creating the model actually takes five lines of code. The resulting neural network is shown on the right. Source

Creating the model actually takes five lines of code. The resulting neural network is shown on the right. Source

More complex networks are modeled using classes. Within a class, you can describe the main structural parts of a neural network, and then create a model – an object of this class.

This network also consists of two layers with a sigmoid activation function. Source

This network also consists of two layers with a sigmoid activation function. Source

TensorFlow

This one of the most famous libraries for neural networks in Python. It represents data as multidimensional arrays, and operations with them as graphs that are built before running the program. TensorFlow implements many methods for creating, deploying, training and running neural networks and ML models.

How is it used in ML

TensorFlow can be used in various ML areas, but it is most often used to build neural networks and deep learning models. The library is used to classify images, texts, and sounds, as well as for NLP — natural language processing. The latest versions of TensorFlow include the Keras library by default, which allows you to create the same models with less code.

In addition, the TensorFlow library is adapted to different types of computing platforms, including mobile ones, so it is used in cross-platform development of ML solutions.

The library is often compared to PyTorch. Some experts believe that the latter is better suited for solving academic problems, while TensorFlow is better suited for production. However, both libraries are used in both industries. PyTorch is considered slightly easier to learn than TensorFlow, but experts usually recommend trying both libraries and choosing the one you like more.

Example of a solution to a problem

TensorFlow is valued for its high level of abstraction: you can write one clear command, rather than describe the technical aspects of the implementation. Thanks to this, the ML engineer can focus on the logic of the model, rather than on small details. For example, the code below creates a neural network that recognizes images from a data set MNIST — a classic dataset with handwritten digit samples.

It is enough to describe each layer of the network and set the activation function for it, there is no need to write all the details from scratch. Source

It is enough to describe each layer of the network and set the activation function for it; there is no need to write out all the details from scratch. Source

Then all that remains is to compile and train the model on the prepared data. This can also be done with a couple of commands, just substitute the necessary parameters. You need to specify the number of epochs, that is, the complete passes of the dataset through the model, as well as the batch size – the “groups” into which the dataset is divided.

In this example, the dataset is divided into groups of 128 elements and passed through the neural network 15 times. Depending on the task, the values ​​​​may be different. Source

In this example, the dataset is divided into groups of 128 elements and passed through the neural network 15 times. Depending on the task, the values ​​may be different. Source

NLTK

NLTK Library for Python specializes in a specific area of ​​ML — natural language processing. You won’t be able to build your own models using NLTK, but it has many pre-implemented ones, including neural network ones. In addition, the library contains a large number of functions for text processing.

How is it used in ML

NLTK is used for fast and flexible work with text: in natural language processing, computational linguistics and related fields.

The library is used to perform primary analysis of texts, for example, classify them by topic or determine the sentiment. The data is also filtered and converted into the required format. In addition, the library provides access to huge text databases for training models.

Example of a solution to a problem

Using NLTK, you can remove stop words from text in a few lines. The code below loads a list of stop words for the Russian language, then breaks the entered text into words and filters them out — removes words that are on the stop list.

To split the text into words, you only need one command. Another command can get stop words for the Russian language. Then, using a cycle, words are filtered from the given text. Source

To split the text into words, you only need one command. Another command can get stop words for the Russian language. Then, using a cycle, words are filtered from the given text. Source

OpenCV

Let's finish the selection with OpenCV — the most famous Python library for computer vision. It includes functions for building models, image processing tools, object recognition and selection, and much more.

How is it used in ML

OpenCV is used in computer vision – this library is suitable for all areas of this field. For example, it can be used to process photos and videos, create face recognition systems and segment vessels in medical scans. The library also underlies many algorithms for robotics – its capabilities help robots “see” the world around them.

With OpenCV, you can transform images, filter their elements and cut off unnecessary ones, detect and extract objects with specified properties. In addition to working with classic “flat” two-dimensional images, there is a separate set of functions for calibrating the camera and working with 3D objects.

Example of a solution to a problem

In the code below, OpenCV splits the image into color channels H, S, and V in one line. The rest of the code is visualization using the Matplotlib library.

Dividing an image into channels is sometimes necessary for its processing and subsequent analysis. Source

Dividing an image into channels is sometimes necessary for its processing and subsequent analysis. Source

This is what the result will look like. You can work with each channel separately, for example, modify the color intensity, change shades, etc. And this is only a small part of the capabilities of OpenCV.

If needed, each of these channels can be used to analyze or modify the image. Source

If needed, each of these channels can be used to analyze or modify the image. Source

There are many more tools for ML, from libraries for building models to visualization. But this selection will help you take your first steps in mastering machine learning and Python.

Expert Advice: How to Learn Applied Machine Learning

  1. Start by representing your data using NumPy and doing some initial analysis in Pandas.

  2. Then test hypotheses from a diverse pool of algorithms implemented in Sklearn.

  3. If you need to use classic ML approaches close to SOTA on industrial data (or want to win a competition on Kaggle) — try one of three libraries: Xgboost, LightGBM, or Catboost. Or move on to neural network architectures in PyTorch or TensorFlow.

  4. After that, you can choose a specific specialization, such as image analysis with OpenCV or text analysis in NLTK. And continue to study various algorithms, approaches, theories, and libraries that implement them.


Skillfactory and NRNU MEPhI created Master's program for those who want to master Data Science and ML to an advanced level. Students will learn how to create intelligent models for various fields — from IT and finance to science and medicine, train them and implement them in production. They will master the basics of mathematics and programming in Python, and will also be able to get real cases on ML training in IT companies — partners of the program.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *