15 Open Source Libraries to Improve Data Quality

The author of this article, a programmer and ML engineer, has compiled Open Source Python libraries that will help you make data better to avoid wasting time and simplify data analysis. We share a selection to start data analysis course.


Profiling and evaluation

Exploratory Data Analysis

1. Pandas Profiling

Pandas Profiling generates a Pandas data frame profiling report.

Main functions:

  • Data profiling: missing and unique values, etc.

  • Data distributions and histograms.

  • Quantile and descriptive statistics: mean, standard deviation, Q1, etc.

  • Data type inference.

  • Data interactions and correlations.

  • Creating a report in HTML.

2. Great Expectations

Great Exception based on data assertions from the library expectation. It is a public, open data quality standard that helps Data Science teams debug data pipelines by testing, documenting, and profiling data.

Main functions:

Declarative data tests on:

  • the expected number of rows in the table is from x to y;

  • the expected number of missing values ​​will not exceed 20%;

  • expected date format in columns is MM-DD-YYYY;

  • additional out-of-the-box constructs: uniqueness, outliers, and other data characteristics;

  • user expectations.

Other features:

  • Automatic data profiling.

  • Visualization of tests in human-friendly forms and documents.

  • Integration with many tools and systems: Pandas, Jupyter Notebook, Spark, mysql, databricks, etc.;

3. SodaSQL

SodaSQL is a command line tool that executes SQL queries based on input. Here is what it does:

  • Runs tests on different datasets in different data sources: Snowflake, PostgreSQL, Athena, etc., looking for invalid or missing data.

  • Collects metrics: minimum, maximum and average values, standard deviation and many other metrics.

Main functions:

  • User tests in SQL.

  • Definition of tests for each table in yml format.

  • Integration with data orchestration tool.

  • Connecting and scanning datasets.

  • Column format definition: email, date, phone numbers, etc.

  • Saving scan results in JSON.

Predictive Analytics

4.Ydata

Ydata evaluates the data quality of the data pipeline at different stages of its development. It helps to create a holistic view of the data, considering it from different points of view on the subject:

  • missing values;

  • data duplication;

  • outliers and data drift;

  • data relationships and data correlation.

The library integrates with Great Expectationswhich runs data asserts that allow you to check, profile data and automatically generate reports:

5.DeepChecks

DeepChecks is a Python package that allows you to easily check ML models and data related to various tasks, such as model performance; also DeepChecks detects:

  • null values;

  • data duplication;

  • frequency changes;

  • special characters, etc.;

The library compares strings, detects their inconsistencies. She sees the following characteristics of the data:

Data integrity and bias detection come in handy when testing data.

When working with model training data, test data, and current data frames, you can use the SingleDatasetIntegrity test suite or custom tests from other suites.

In DeepChecks, you can write your own tests and their sets, beautifully displaying the results in a table or on a Plotly chart:

6. Evidently AI

Evidently AI is a tool for analyzing and observing ML models.

The library sees:

  • Data distribution.

  • Data drift.

  • model performance.

  • Model performance.

Evidently AI integrates with Grafana and Prometheus, you can create a custom dashboard.

7. Alibi Detect

Alibi Detect – a specialized ML library for detecting outliers (outliers), contention and data drift.

Main functions:

  • Detect drift and outliers in tabular data, text, images, and time series.

  • Detection with pretrained and untrained detector.

  • Support for TensorFlow and PyTorch backends for drift detection.

Cleaning and formatting data

1. Scrabadub

Scrabadub – this tool detects and removes personal information from any text: names, phone numbers, addresses, credit card numbers, etc. You can implement your own data discovery tools:

text = "My cat can be contacted on example@example.com, or 1800 555-5555" scrubadub.clean(text) 
>>'My cat can be contacted on {{EMAIL}}, or {{PHONE}}'

2.Arrow

IN Arrow implemented a reasonable, convenient approach to creating, processing, formatting and converting dates, times and timestamps:

utc = arrow.utcnow() 
time= utc.to('US/Pacific') 
past = time.dehumanize("2 days ago") 
print(past) 
>> 2022-01-09T10:11:11.939887+00:00
print(past.humanize(locale="ar"))  
>> 'منذ يومين'

3. Beautifier

Beautifier is a library for scraping URL patterns and email addresses. It allows:

  • Check if the email address is correct.

  • Parse emails by domain and username.

  • Analyze URL by domains and parameters.

  • Clean URLs of Unicode characters, special characters, and unnecessary redirect patterns.

4.Ftfy

ftfy stands for Fixes text for you (“Corrects the text for you”). Here are its functions:

  • Correction of text with unsuitable for markup languages ​​with Unicode characters.

  • Removing line breaks.

  • Convert HTML entities to plain text.

  • Detection of text with a probability of distortion due to incorrect encoding.

  • Explanations of what happened to the text.

ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.') 
>>"The Mona Lisa doesn't have eyebrows."

5. Dora

Dora is an exploratory data analysis toolkit for Python.

Main functions:

  • Clearing data from “null”, converting from categorical data to ordinal data, converting data in columns, and deleting columns.

  • Selection and extraction of features.

  • Displaying signs on the chart.

  • Splitting data for model validation.

  • Data transformations with their versioning.

Many features, including graphs, require data to be numeric.

6. DataCleaner

data cleaner automatically cleans up datasets and prepares them for analysis.

Main functions:

  • Removing rows with missing values.

  • Replacing missing values.

  • Coding of non-numeric variables.

  • Working with Pandas DataFrames.

  • Work in scripts and on the command line.

Table preview

1. Tabulate

Single function call Tabulate prints small, beautiful tables.

Main functions:

2. PrettyPandas

PrettyPandas is a tool with a simple API that generates decent tabular reports. They are well received due to:

That’s all for today. You can try all these tools in practice in our courses. And we will help you improve your skills or from the very beginning master a profession in IT that is in demand at any time:

Choose another in-demand profession.

Brief catalog of courses and professions

Data Science and Machine Learning

Python, web development

Mobile development

Java and C#

From basics to depth

As well as

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *