Four functions for fast work with Big Data

I often use functions to work with big data. They make your work easier and faster. Some I found on the Internet, others I wrote myself. Today I want to share four of them, maybe it will be useful to someone.

Fast reading of large files

One of the first tools that an analyst or Data Scientist comes across is Pandas, a Python library for data processing and analysis. With its help, we import and sort data, make selections and find dependencies. For example, to read a file using Pandas in Python, we write:

import pandas as pd
data = pd.read_csv('data_file.csv')

This approach is simple and understandable. Every Data Scientist or Analyst knows this. But if there is a lot of data? Let’s say 100,000,000 lines, they are constantly changing, deadlines are burning, and there are 100 more hypotheses to be tested before lunch?

Let’s take diabetes research dataset from the Kaggle website and duplicate every 100,000 rows to create our test dataset. The result is 76,800,000 lines.

The original dataset looks like this:

Pregnancies

Glucose

bloodpressure

SkinThickness

insulin

BMI

DiabetesPedigreeFunction

Age

Outcome

0

6

148

72

35

0

33.6

0.627

fifty

one

one

one

85

66

29

0

26.6

0.351

31

0

2

eight

183

64

0

0

23.3

0.672

32

one

3

one

89

66

23

94

28.1

0.167

21

0

four

0

137

40

35

168

43.1

2.288

33

one

763

ten

101

76

48

180

32.9

0.171

63

0

764

2

122

70

27

0

36.8

0.340

27

0

765

5

121

72

23

112

26.2

0.245

thirty

0

766

one

126

60

0

0

30.1

0.349

47

one

767

one

93

70

31

0

30.4

0.315

23

0

We transform it for our experimental tasks:

# Чтение датасета из файла
df = pd.read_csv('diabetes.csv')

# Создание тестового файла
df = df.loc[df.index.repeat(100000)]

# Сохранение в файл для экспирементов
df.to_csv('benchmark.csv')

print(len(df), 'строк')
76 800 000 строк
Lead time

The transformation has been long, but let’s hope it’s time well spent. How long will it take to read such a file?

df = pd.read_csv('benchmark.csv')
Lead time:

If we reload the changed dataset each time, our 100 hypotheses will take 13 hours. Can’t be done before lunch.

We want to load data faster without losing all the benefits that Pandas provides. A simple function using datatable will help us do this:

import datatable as dt
import pandas as pd
def read_fast_csv(f):
    frame = dt.fread(f)
    ds = frame.to_pandas()
    return ds

Now let’s try to read our big dataset:

ds = read_fast_csv('benchmark.csv')
Lead time

It’s 6 times faster! So we’ll make it to lunch!

Quick introduction to the dataset

This function displays several parameters of the dataset in order to get acquainted with its contents, in particular:

  • Dataset size

  • Information about duplicates

  • Percent of missing values

  • Number of unique values

  • The first five lines of the dataset

def brief_df (df):
  # Подсчитываем пустые значения и уникальные значение
  rows_na =df.isna().sum().reset_index().rename(columns={0: "valuesNa"})
  rows_notna = df.notna().sum().reset_index().rename(columns={0: "valuesNotNa"})
  rows_analysis = pd.merge(rows_na, rows_notna, on="index", how= "outer")
  rows_analysis["completeRatio"] = round((rows_analysis["valuesNotNa"]) / (rows_analysis["valuesNotNa"]+rows_analysis["valuesNa"])*100,2)

  cardinality = df.nunique().reset_index().rename(columns={0: "cardinality"})
  rows_analysis = pd.merge(rows_analysis, cardinality)

  # Размер датасета и кол-во дубликатов
  print("Размер:", df.shape)
  dup_raw = df.duplicated ().sum()
  dup_per = round((dup_raw*100)/df.shape[0],2)
  print ("Дубликаты:", dup_raw, "->", dup_per, "%")

  # Статистика по пустым значениям
  print("Проверка на отсутсвующие значениия")
  display(rows_analysis)

  # Первые пять строк
  print("Первые пять строк")
  display(df.head())

Let’s try to analyze our large dataset that we created in the first section:

brief_df(df)

Size: (76800000, 10)
Duplicates: 76799232 -> 100.0%
Null Overview
:

index

valuesNa

valuesNotNa

completeRatio

cardinality

0

C0

0

76800000

100.0

768

one

Pregnancies

0

76800000

100.0

17

2

Glucose

0

76800000

100.0

136

3

bloodpressure

0

76800000

100.0

47

four

SkinThickness

0

76800000

100.0

51

5

insulin

0

76800000

100.0

186

6

BMI

0

76800000

100.0

248

7

DiabetesPedigreeFunction

0

76800000

100.0

517

eight

Age

0

76800000

100.0

52

9

Outcome

0

76800000

100.0

2

First five lines

C0

Pregnancies

Glucose

bloodpressure

SkinThickness

insulin

BMI

DiabetesPedigreeFunction

Age

Outcome

0

0

6

148

72

35

0

33.6

0.627

fifty

True

one

0

6

148

72

35

0

33.6

0.627

fifty

True

2

0

6

148

72

35

0

33.6

0.627

fifty

True

3

0

6

148

72

35

0

33.6

0.627

fifty

True

four

0

6

148

72

35

0

33.6

0.627

fifty

True

Yes, we really have 100% duplicates, because we duplicated each line 100,000 times. So everything fits.

Columns with dates in different formats

To analyze and visualize data, it is often necessary to work with dates in different formats when only one date is given. And even worse, if all the dates are spelled out haphazardly. We have to bring all dates to a standard form, and then programmatically extract different information from each date. I wrote a function that greatly simplifies this process.

First, let’s simulate our situation, and add a column with random dates in the range of the last two years to our original dataset:

import numpy as np

# Чтение датасета из файла
df = pd.read_csv('diabetes.csv')

df['date'] = np.random.choice(pd.date_range('2020-01-01', '2022-12-31'), 768)

df.to_csv('diabetes_dates.csv')

Below is the code of a function that adds columns with the most frequent required formats to an existing dataset in order to make selections by year, quarter, month or week, or take a full date for visualization purposes.

import datetime
from datetime import timedelta

def granular_dates(df, col):
    df['ts_date'] = pd.to_datetime(df[col]).dt.normalize()
		# Полная дата с названием месяца
    df['ts_date_str'] = df["ts_date"].dt.strftime("%d %B %Y")
		# Краткая дата с сокращением месяца
    df['ts_date_str_short'] = df["ts_date"].dt.strftime("%d %b %Y")
    # Только год
    df['ts_year'] = df.ts_date.dt.year
    # Только номер месяца
    df['ts_month'] = df.ts_date.dt.month
    # Только число
    df['ts_day'] = df.ts_date.dt.day 
    # Год и квартал
    df['ts_quarter'] = pd.PeriodIndex(df.ts_date, freq="Q").astype(str)
    # Номер недели
    df['ts_dayweek'] = df.ts_date.dt.dayofweek
    # День недели
    df['ts_dayweektext'] = df.ts_date.dt.strftime("%a") 
    # Дата конца недели (воскресенья)
    df['ts_week_start'] = df.ts_date.apply(lambda x: x - timedelta(days=x.weekday())).dt.strftime("%b-%d") 
    # Дата конца недели (воскресенья)    
    df['ts_week_end'] = df.ts_date.apply(lambda x: x - timedelta(days=x.weekday()) + timedelta(days=6)).dt.strftime("%b-%d") 

Now just one line of code (not including reading the file):

# Чтение датасета
df = pd.read_csv('diabetes_dates.csv')

# Добавление колонок с датами разного формата
granular_dates(df, 'date')

As a result, the following columns are added to the dataset:

ts_date

ts_date_str

ts_date_str_short

ts_year

ts_month

ts_day

ts_quarter

ts_dayweek

ts_dayweektext

ts_week_start

ts_week_end

2022-06-13

June 13, 2022

Jun 13, 2022

2022

6

13

2022Q2

0

Mon

Jun-13

Jun-19

2022-03-10

March 10, 2022

10 Mar 2022

2022

3

ten

2022Q1

3

Thu

Mar-07

Mar-13

2022-05-23

May 23, 2022

May 23, 2022

2022

5

23

2022Q2

0

Mon

May-23

May-29

2020-11-22

November 22, 2020

Nov 22, 2020

2020

eleven

22

2020Q4

6

Sun

Nov-16

Nov-22

2021-10-27

October 27, 2021

27 Oct 2021

2021

ten

27

2021Q4

2

Wed

Oct-25

Oct-31

Export to Google Sheets

So, now we can quickly read huge datasets, get acquainted with their key parameters and make selections for all possible combinations of dates. Now it’s time to share the results of the analysis with other team members. Sometimes it is more convenient to do this through a cloud service. You often have to use Google Spreadsheets.

Python has a wonderful Gspread package that will help you export your selections directly to a Google Spreadsheet, from where it can be shared with others for collaboration from anywhere in the world.

After you have installed the package and received an API key from Google for your account, use the function below to create a table from your data:

def writedf (df, gc, wsheetname, colstart, rowstart, colend, boolheader):
    wsheet=gc.worksheet(wsheetname)
    en_update = str(df.shape[0]+1)
    range_update = colstart + rowstart + ':' + colend
    
    cols_list = df.columns.to_list()
    header = pd.DataFrame(cols_list).T
    header.columns = cols_list
    
    if boolheader == True:
        df = pd.concat([header, df])

    data = df.values.tolist()
    wsheet.update(range_update, data)

One of the benefits of using this feature is that the data is written in a single request instead of writing each cell separately, which saves the number of calls to the Google Spreadsheet API, and therefore less chance of running into restrictions.

import gspread

# Утсанавливаем соединение
gc = gspread.service_account()

# Заполняем таблицу в Google
writedf (df, gc, 'mySheet', 0, 0, 100, True)

In a similar way, you can also make friends with our dataset Notion – a modern replacement for Evernote, but that would be material for a whole separate article.

It would be interesting to know what developments in Python functions you have, what do you often use?

Similar Posts

Leave a Reply Cancel reply