Four functions for fast work with Big Data
I often use functions to work with big data. They make your work easier and faster. Some I found on the Internet, others I wrote myself. Today I want to share four of them, maybe it will be useful to someone.
Fast reading of large files
One of the first tools that an analyst or Data Scientist comes across is Pandas, a Python library for data processing and analysis. With its help, we import and sort data, make selections and find dependencies. For example, to read a file using Pandas in Python, we write:
import pandas as pd
data = pd.read_csv('data_file.csv')
This approach is simple and understandable. Every Data Scientist or Analyst knows this. But if there is a lot of data? Let’s say 100,000,000 lines, they are constantly changing, deadlines are burning, and there are 100 more hypotheses to be tested before lunch?
Let’s take diabetes research dataset from the Kaggle website and duplicate every 100,000 rows to create our test dataset. The result is 76,800,000 lines.
The original dataset looks like this:
Pregnancies |
Glucose |
bloodpressure |
SkinThickness |
insulin |
BMI |
DiabetesPedigreeFunction |
Age |
Outcome |
|
---|---|---|---|---|---|---|---|---|---|
0 |
6 |
148 |
72 |
35 |
0 |
33.6 |
0.627 |
fifty |
one |
one |
one |
85 |
66 |
29 |
0 |
26.6 |
0.351 |
31 |
0 |
2 |
eight |
183 |
64 |
0 |
0 |
23.3 |
0.672 |
32 |
one |
3 |
one |
89 |
66 |
23 |
94 |
28.1 |
0.167 |
21 |
0 |
four |
0 |
137 |
40 |
35 |
168 |
43.1 |
2.288 |
33 |
one |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
763 |
ten |
101 |
76 |
48 |
180 |
32.9 |
0.171 |
63 |
0 |
764 |
2 |
122 |
70 |
27 |
0 |
36.8 |
0.340 |
27 |
0 |
765 |
5 |
121 |
72 |
23 |
112 |
26.2 |
0.245 |
thirty |
0 |
766 |
one |
126 |
60 |
0 |
0 |
30.1 |
0.349 |
47 |
one |
767 |
one |
93 |
70 |
31 |
0 |
30.4 |
0.315 |
23 |
0 |
We transform it for our experimental tasks:
# Чтение датасета из файла
df = pd.read_csv('diabetes.csv')
# Создание тестового файла
df = df.loc[df.index.repeat(100000)]
# Сохранение в файл для экспирементов
df.to_csv('benchmark.csv')
print(len(df), 'строк')
76 800 000 строк
Lead time
The transformation has been long, but let’s hope it’s time well spent. How long will it take to read such a file?
df = pd.read_csv('benchmark.csv')
Lead time:
If we reload the changed dataset each time, our 100 hypotheses will take 13 hours. Can’t be done before lunch.
We want to load data faster without losing all the benefits that Pandas provides. A simple function using datatable will help us do this:
import datatable as dt
import pandas as pd
def read_fast_csv(f):
frame = dt.fread(f)
ds = frame.to_pandas()
return ds
Now let’s try to read our big dataset:
ds = read_fast_csv('benchmark.csv')
Lead time
It’s 6 times faster! So we’ll make it to lunch!
Quick introduction to the dataset
This function displays several parameters of the dataset in order to get acquainted with its contents, in particular:
-
Dataset size
-
Information about duplicates
-
Percent of missing values
-
Number of unique values
-
The first five lines of the dataset
def brief_df (df):
# Подсчитываем пустые значения и уникальные значение
rows_na =df.isna().sum().reset_index().rename(columns={0: "valuesNa"})
rows_notna = df.notna().sum().reset_index().rename(columns={0: "valuesNotNa"})
rows_analysis = pd.merge(rows_na, rows_notna, on="index", how= "outer")
rows_analysis["completeRatio"] = round((rows_analysis["valuesNotNa"]) / (rows_analysis["valuesNotNa"]+rows_analysis["valuesNa"])*100,2)
cardinality = df.nunique().reset_index().rename(columns={0: "cardinality"})
rows_analysis = pd.merge(rows_analysis, cardinality)
# Размер датасета и кол-во дубликатов
print("Размер:", df.shape)
dup_raw = df.duplicated ().sum()
dup_per = round((dup_raw*100)/df.shape[0],2)
print ("Дубликаты:", dup_raw, "->", dup_per, "%")
# Статистика по пустым значениям
print("Проверка на отсутсвующие значениия")
display(rows_analysis)
# Первые пять строк
print("Первые пять строк")
display(df.head())
Let’s try to analyze our large dataset that we created in the first section:
brief_df(df)
Size: (76800000, 10)
Duplicates: 76799232 -> 100.0%
Null Overview:
index |
valuesNa |
valuesNotNa |
completeRatio |
cardinality |
|
---|---|---|---|---|---|
0 |
C0 |
0 |
76800000 |
100.0 |
768 |
one |
Pregnancies |
0 |
76800000 |
100.0 |
17 |
2 |
Glucose |
0 |
76800000 |
100.0 |
136 |
3 |
bloodpressure |
0 |
76800000 |
100.0 |
47 |
four |
SkinThickness |
0 |
76800000 |
100.0 |
51 |
5 |
insulin |
0 |
76800000 |
100.0 |
186 |
6 |
BMI |
0 |
76800000 |
100.0 |
248 |
7 |
DiabetesPedigreeFunction |
0 |
76800000 |
100.0 |
517 |
eight |
Age |
0 |
76800000 |
100.0 |
52 |
9 |
Outcome |
0 |
76800000 |
100.0 |
2 |
First five lines
C0 |
Pregnancies |
Glucose |
bloodpressure |
SkinThickness |
insulin |
BMI |
DiabetesPedigreeFunction |
Age |
Outcome |
|
---|---|---|---|---|---|---|---|---|---|---|
0 |
0 |
6 |
148 |
72 |
35 |
0 |
33.6 |
0.627 |
fifty |
True |
one |
0 |
6 |
148 |
72 |
35 |
0 |
33.6 |
0.627 |
fifty |
True |
2 |
0 |
6 |
148 |
72 |
35 |
0 |
33.6 |
0.627 |
fifty |
True |
3 |
0 |
6 |
148 |
72 |
35 |
0 |
33.6 |
0.627 |
fifty |
True |
four |
0 |
6 |
148 |
72 |
35 |
0 |
33.6 |
0.627 |
fifty |
True |
Yes, we really have 100% duplicates, because we duplicated each line 100,000 times. So everything fits.
Columns with dates in different formats
To analyze and visualize data, it is often necessary to work with dates in different formats when only one date is given. And even worse, if all the dates are spelled out haphazardly. We have to bring all dates to a standard form, and then programmatically extract different information from each date. I wrote a function that greatly simplifies this process.
First, let’s simulate our situation, and add a column with random dates in the range of the last two years to our original dataset:
import numpy as np
# Чтение датасета из файла
df = pd.read_csv('diabetes.csv')
df['date'] = np.random.choice(pd.date_range('2020-01-01', '2022-12-31'), 768)
df.to_csv('diabetes_dates.csv')
Below is the code of a function that adds columns with the most frequent required formats to an existing dataset in order to make selections by year, quarter, month or week, or take a full date for visualization purposes.
import datetime
from datetime import timedelta
def granular_dates(df, col):
df['ts_date'] = pd.to_datetime(df[col]).dt.normalize()
# Полная дата с названием месяца
df['ts_date_str'] = df["ts_date"].dt.strftime("%d %B %Y")
# Краткая дата с сокращением месяца
df['ts_date_str_short'] = df["ts_date"].dt.strftime("%d %b %Y")
# Только год
df['ts_year'] = df.ts_date.dt.year
# Только номер месяца
df['ts_month'] = df.ts_date.dt.month
# Только число
df['ts_day'] = df.ts_date.dt.day
# Год и квартал
df['ts_quarter'] = pd.PeriodIndex(df.ts_date, freq="Q").astype(str)
# Номер недели
df['ts_dayweek'] = df.ts_date.dt.dayofweek
# День недели
df['ts_dayweektext'] = df.ts_date.dt.strftime("%a")
# Дата конца недели (воскресенья)
df['ts_week_start'] = df.ts_date.apply(lambda x: x - timedelta(days=x.weekday())).dt.strftime("%b-%d")
# Дата конца недели (воскресенья)
df['ts_week_end'] = df.ts_date.apply(lambda x: x - timedelta(days=x.weekday()) + timedelta(days=6)).dt.strftime("%b-%d")
Now just one line of code (not including reading the file):
# Чтение датасета
df = pd.read_csv('diabetes_dates.csv')
# Добавление колонок с датами разного формата
granular_dates(df, 'date')
As a result, the following columns are added to the dataset:
ts_date |
ts_date_str |
ts_date_str_short |
ts_year |
ts_month |
ts_day |
ts_quarter |
ts_dayweek |
ts_dayweektext |
ts_week_start |
ts_week_end |
2022-06-13 |
June 13, 2022 |
Jun 13, 2022 |
2022 |
6 |
13 |
2022Q2 |
0 |
Mon |
Jun-13 |
Jun-19 |
2022-03-10 |
March 10, 2022 |
10 Mar 2022 |
2022 |
3 |
ten |
2022Q1 |
3 |
Thu |
Mar-07 |
Mar-13 |
2022-05-23 |
May 23, 2022 |
May 23, 2022 |
2022 |
5 |
23 |
2022Q2 |
0 |
Mon |
May-23 |
May-29 |
2020-11-22 |
November 22, 2020 |
Nov 22, 2020 |
2020 |
eleven |
22 |
2020Q4 |
6 |
Sun |
Nov-16 |
Nov-22 |
2021-10-27 |
October 27, 2021 |
27 Oct 2021 |
2021 |
ten |
27 |
2021Q4 |
2 |
Wed |
Oct-25 |
Oct-31 |
Export to Google Sheets
So, now we can quickly read huge datasets, get acquainted with their key parameters and make selections for all possible combinations of dates. Now it’s time to share the results of the analysis with other team members. Sometimes it is more convenient to do this through a cloud service. You often have to use Google Spreadsheets.
Python has a wonderful Gspread package that will help you export your selections directly to a Google Spreadsheet, from where it can be shared with others for collaboration from anywhere in the world.
After you have installed the package and received an API key from Google for your account, use the function below to create a table from your data:
def writedf (df, gc, wsheetname, colstart, rowstart, colend, boolheader):
wsheet=gc.worksheet(wsheetname)
en_update = str(df.shape[0]+1)
range_update = colstart + rowstart + ':' + colend
cols_list = df.columns.to_list()
header = pd.DataFrame(cols_list).T
header.columns = cols_list
if boolheader == True:
df = pd.concat([header, df])
data = df.values.tolist()
wsheet.update(range_update, data)
One of the benefits of using this feature is that the data is written in a single request instead of writing each cell separately, which saves the number of calls to the Google Spreadsheet API, and therefore less chance of running into restrictions.
import gspread
# Утсанавливаем соединение
gc = gspread.service_account()
# Заполняем таблицу в Google
writedf (df, gc, 'mySheet', 0, 0, 100, True)
In a similar way, you can also make friends with our dataset Notion – a modern replacement for Evernote, but that would be material for a whole separate article.
It would be interesting to know what developments in Python functions you have, what do you often use?