Method chaining in pandas. Let’s upload our code. Miniguide for those interested


Lucky panda who started using call chain.

Lucky panda who started using call chain.

Method chain (or call chain, method chaining) is a style of writing code that allows you to perform several operations at once, ultimately saving time and energy.

For those who are not familiar with this concept, it is essentially a way of applying multiple methods or functions to data in one line of code. The traditional approach to using pandas involves using individual functions and commands one at a time. This can get quite tedious and hard to remember. Also, if something goes wrong, it can be difficult to troubleshoot since multiple operations have been used. I still had a habit of jumping from one cell to another, then you can’t remember exactly what you did and it’s easier to rewrite everything again.

The chain method will help eliminate this problem. By chaining methods together, you can easily perform multiple operations at the same time. This reduces the time spent writing code and makes it easier to read and debug.

The chain method has several advantages:

  • allows data frames to be manipulated in multiple ways at the same time, making it a great tool for solving complex problems.

  • greatly simplifies debugging and reading code in the future. Your future self will thank you!

  • reduces the number of created variables to a minimum. No more headaches about variable names.

  • avoids the SettingWithCopyWarning problem that can occur when trying to change a copy of a data frame instead of the original.

  • method chaining is considered to be more efficient than single steps since operations can be performed sequentially without creating intermediate frames that increase pandas memory usage.

This method is actively promoting Matt Harrison (google Idiomatic Pandas, Effective Pandas: Patterns for Data Manipulation, there are interesting talks on PyData). The chain of methods is also mentioned in the official pandas cheetsheet, but for me it is too casual and in an inconspicuous left corner, given how this method improves the readability of the code:

Mentioning Method Chaining in CheetSheet on the pandas site.

Mentioning Method Chaining in CheetSheet on the pandas site.

It is important to understand the potential downsides and issues before moving on to this approach:

  • perhaps this style may not be suitable for beginners – although the process of combining operations may seem simple, but each operation changes the data, and these changes may not be completely clear to the user if he has little practice.

  • may be computationally inefficient. If the analysis requires many operations, the cost of performing them within a single operation can be quite high.

Prologue: Method Call Chaining in OOP

Method call chain, is a common syntax for calling multiple methods in object-oriented programming languages. Each method returns an object, which allows you to combine calls into a single structure without requiring variables to store intermediate results.

  • this can reduce the length of the entire code since there is no need to create countless variables.

  • this can improve the readability of the code since the methods are called sequentially.

An example on a simple Cat class:

class Cat:
    def __init__(self, name=None, breed=None):
        self.name = name
        self.breed = breed

    def set_name(self, name):
        self.name = name

    def set_breed(self, breed):
        self.breed = breed

    def __repr__(self, ):
        class_name = type(self).__name__
        attr=", ".join([f'{i}={v!r}' for i, v in self.__dict__.items()])
        return f"{class_name}({attr})"

Because methods do not return an instance of the class (self), then we successively apply them to set the nickname and breed:

djon_the_cat = Cat() 
djon_the_cat.set_name('Djon')
djon_the_cat.set_breed('Siamse')
>>> Cat(name="Djon", breed='siamse')

But if we add return self to each method, then we can already combine our cat into a chain:

code with adding self
class Cat:
    def __init__(self,name=None,breed=None):
        self.name=name
        self.breed=breed
    
    def set_name(self,name):
        self.name=name
        #добавляем return self
        return self
        
    def set_breed(self,breed):
        self.breed=breed  
        #добавляем return self
        return self
    
    def __repr__(self,):
        class_name=type(self).__name__
        attr=", ".join([f'{i}={v!r}' for i,v in self.__dict__.items()])
        return f"{class_name}({attr})"
djon_the_cat = Cat().set_name('Djon').set_breed('Siamse')
>>> Cat(name="Djon", breed='siamse')

Likewise in pandas, our goal is to get away from this:

df = pd.read_excel('./materials/бурение_2022.xlsm', header=None)

dict_replace_tpp = {'СП "предприятие_1"':"п_1",
                    'СП "предприятие_2"':"п_2",
                    'СП "предприятие_3"':"п_3",
                    'СП "предприятие_3"':"п_4"}

replace_type = {'Из них горизонтальных':'Г',
'Из них вертикальных и наклонно-направленных':'ННС',
'Из разведочного бурения':"Р"}  


df = df[dict_columns.keys()].rename(columns = dict_columns)
df['итого добыча'] = pd.to_numeric(df['итого добыча'], errors="coerce")
df = df.dropna(subset=['итого добыча','объект']).reset_index(drop=True)
df['дата ввода факт'] = pd.to_datetime(df['дата ввода факт'], dayfirst=True)
df['дата ввода план'] = pd.to_datetime(df['дата ввода план'], dayfirst=True)
df['ТПП'] = df['ТПП'].replace(dict_replace_tpp)

df['тип'] = df['тип'].map(replace_type)
df['тип'] = df['тип'].mask(df['объект'].str.contains("СМД|смд"), 'СМД')
df = df.sort_values(by=['ТПП','месторождение','дата ввода факт'], ascending=False).reset_index(drop=True)

df['дата ввода план'] = pd.to_datetime(df['дата ввода план'].dt.date)
df['дата ввода факт'] = pd.to_datetime(df['дата ввода факт'].dt.date)

df = df[df['ТПП'] != "п_3"].reset_index(drop=True)

To that:

df = pd.read_csv('./materials/Обводненность ЛБ.csv')
(df
 .query(...)
 .pipe(lambda _df:_df ...)
 .assign(**{'дата отбора':...})
 ...
 .pipe(lambda _df:_df ...)
)

Such a record will help to significantly increase the readability of the code.

The three main tools are pipe, assign and query.

pipe method:

pandas.DataFrame.pipe(func, *args, **kwargs)

pandas pipe method is a tool for tying together multiple functions with a single call. This allows you to perform multiple sequential operations on the same frame without having to store intermediate results in separate variables. The pipe method is especially useful when working with large datasets and complex operations such as transformations, data aggregations, and allows you to do everything in one step.

The code for using the pipe method is relatively simple. All you have to do is pass the function to the pipe() method. The functions will be executed in order, and the output of each successive function will be the input of the next function.

Once the pipe() method has finished executing, the result will be the output of the last function in the list.

Consider a simple example of how the pipe method can be used:

>>> d = {'col1': [1,2],'col2': [3,4]}
>>> df = pd.DataFrame(data=d)
>>> df

   col1  col2
0     1     3
1     2     4

>>> df.pipe(lambda _df:_df + 1).pipe(lambda _df:_df + 1)

   col1  col2
0     3     5
1     4     6

▍The use of lambda has a significant advantage, since we get the current data frame.

Instead of writing multiple lines of code to perform multiple operations, the same operations can be performed on a single line. Also, since the pipe() method returns a frame, it will help you create complex pipelines of functions by tying them together.

assign method:

pandas.DataFrame.assign(**kwargs)

assign method in pandas used to assign new columns to the data frame. This method allows you to create new columns or assign new values ​​to existing columns for each frame row. With assign, you can include calculations, constants, and even other frames as the source of values ​​in new columns.

Examples:

>>> d = {'col1': [1,2],'col2': [3,4]}
>>> df = pd.DataFrame(data = d)
>>> df

   col1  col2
0     1     3
1     2     4

>>> df.assign(col3 = lambda _df:_df['col2']+1)

   col1  col2  col3
0     1     3     4
1     2     4     5

# можно передовать и просто значение
>>> df.assign(col3 = 'pandas')

   col1  col2    col3
0     1     3  pandas
1     2     4  pandas

# или с помощью словаря, если в названии колонки есть пробелы
>>> df.assign(**{"col 3":lambda _df:'pandas'})

  col1  col2   col 3
0     1     3  pandas
1     2     4  pandas

As you can see, the assign+lambda method provides a lot of flexibility when it comes to adding new columns and data to the frame.

query method:

pandas.DataFrame.query(expr, *, inplace=False, **kwargs)

query method in pandas allows you to extract certain rows from a frame based on certain conditions. The method accepts a string containing a valid expression that can refer to columns within the frame. This provides an intuitive way to select and filter rows from a dataframe based on certain criteria. In fact, this is the creation of complex queries to select specific data.

>>> d = {'col1': [1,2,3,4,5,6],
         'col2': [7,8,9,10,11,12],
         'col3': ['a','b','a','b','a','g'] }
>>> df = pd.DataFrame(data = d)
>>> df


   col1  col2 col3
0     1     7    a
1     2     8    b
2     3     9    a
3     4    10    b
4     5    11    a
5     6    12    g

# простое выражение с больше/меньше
>>> df.query('col1 < 5')

   col1  col2 col3
0     1     7    a
1     2     8    b
2     3     9    a
3     4    10    b

# выражение с использованием методов pandas к столбцу
>>> df.query("col3.isin(['a'])")

   col1  col2 col3
0     1     7    a
2     3     9    a
4     5    11    a

# можно с помощью @ обращаться к переменным
>>> filter_b = 'b'
>>> df.query("col3.isin([@filter_b])")

   col1  col2 col3
1     2     8    b
3     4    10    b

You can use logical operators such as “&”, “|” and “~”, to create more specific filters. In addition, the method can be used to compare numeric values ​​between different records, string operations, logical operations, and so on. What makes it especially useful is the ability to easily combine multiple conditions together.

Parsing with an example. We count oil at enterprises.

▍Since everyone has already counted the survivors of the Titanic, Iris petals, classified Palmer penguins, we will use real data – we will find the top five oil fields at four production enterprises.

We will use a real reporting plate with the plans of the oil society for 2023, only slightly encrypted:

The most common and ordinary tablet that everyone encounters.

The most common and ordinary tablet that everyone encounters.

Our task will be to identify the top five fields with the maximum oil production (we will define the remaining fields as “the rest”) and plot the average daily production (by months) for each enterprise using call chain:

An example of what we are going to do.

An example of what we are going to do.

Required libraries:

import pandas as pd # version 1.5.3
import plotly.express as px # version 5.13.0

We read our excel file and immediately look at it:

df = pd.concat(pd.read_excel('./materials/oil_prod.xlsx',
              sheet_name=None,
              header=None), ignore_index=True)

df.head(10)

In order not to use “\” for wrapping and go to chaining we have to wrap df in parentheses:

# слеш используется для переноса, чтобы избавится от него, нужно df обернуть в скобки
df \
.loc[:,:]
(df.loc[:,:]
)
# или как кому нравится
(
    df.loc[:,:]
)

let’s leave only the columns we need (we will display them in a separate dictionary variable), getting rid of quarters and an unfilled fact, let’s start creating our chain of methods:

columns_rename = {1:"показатель",5:'январь',7:'февраль',9:'март',
                  13:"апрель",15:"май",17:"июнь",21:"июль",
                  23:"август",25:"сентябрь",29:"октябрь",
                  31:"ноябрь",33:"декабрь"}

(
    df
    .loc[:,columns_rename.keys()]
    .rename(columns=columns_rename)
)

Next, we need to correlate oil production by enterprises, fields, facilities.

columns_rename = {1:"показатель",5:'январь',7:'февраль',9:'март',
                  13:"апрель",15:"май",17:"июнь",21:"июль",
                  23:"август",25:"сентябрь",29:"октябрь",
                  31:"ноябрь",33:"декабрь"}

(
    df
    .loc[:,columns_rename.keys()]
    .rename(columns=columns_rename)
    # выделяем предприятия из колонки показатель
    # с помощью _df['показатель'].str.contains('ТПП|СП')
    # возвращаем True если строка содержит ТПП или СП
    # df['показатель'].where вытаскиваем название предприятия
    # и fillna заполняет название вниз вниз
    .assign(предприятие=lambda _df: (_df['показатель']
                                     .where(_df['показатель'].str.contains('ТПП|СП')
                                     .fillna(False)).fillna(method='pad')))
    .set_index('предприятие')
    .reset_index()
)

to avoid code repetition (we still need to pull out deposits and objects), let’s create a function and call it using the pipe method:

columns_rename = {1:"показатель",5:'январь',7:'февраль',9:'март',
                  13:"апрель",15:"май",17:"июнь",21:"июль",
                  23:"август",25:"сентябрь",29:"октябрь",
                  31:"ноябрь",33:"декабрь"}

def _pipe_column_filter(_df, column_name, text_filter="ТПП|СП"):
    _df = _df.assign(**{column_name:(_df['показатель']
                                     .where(_df['показатель'].str.contains(text_filter)
                                            .fillna(False))
                                     .fillna(method='pad'))})
    _df = _df.set_index(column_name).reset_index()
    return _df


(
    df
    .loc[:,columns_rename.keys()]
    .rename(columns=columns_rename)
    # блок с применением метода pipe
    .pipe(_pipe_column_filter, column_name="объект", text_filter="ТПП|СП|месторождение|[Оо]бъект")      
    .pipe(_pipe_column_filter, column_name="месторождение", text_filter="ТПП|СП|месторождение")    
    .pipe(_pipe_column_filter, column_name="предприятие", text_filter="ТПП|СП")

)

everything went as expected – we received three new columns with enterprises, deposits and objects:

we leave only the fields, delete the objects column and filter the indicator for oil production:

columns_rename = {1:"показатель",5:'январь',7:'февраль',9:'март',
                  13:"апрель",15:"май",17:"июнь",21:"июль",
                  23:"август",25:"сентябрь",29:"октябрь",
                  31:"ноябрь",33:"декабрь"}

indicator_filter = ['добыча нефти за период']

def _pipe_column_filter(_df, column_name, text_filter="ТПП|СП"):
    _df = _df.assign(**{column_name:(_df['показатель']
                                     .where(_df['показатель'].str.contains(text_filter)
                                            .fillna(False))
                                     .fillna(method='pad'))})
    _df = _df.set_index(column_name).reset_index()
    return _df

(
    df
    .loc[:,columns_rename.keys()]
    .rename(columns=columns_rename)
    # блок с применением метода pipe 
    .pipe(_pipe_column_filter, column_name="объект", text_filter="ТПП|СП|месторождение|[Оо]бъект")      
    .pipe(_pipe_column_filter, column_name="месторождение", text_filter="ТПП|СП|месторождение")    
    .pipe(_pipe_column_filter, column_name="предприятие", text_filter="ТПП|СП")
    # блок с применением метода query
    .query("объект.fillna('-').str.contains('месторождение')")
    .query('показатель.isin(@indicator_filter)')
    .drop(columns = ['объект','показатель'])
)

▍In new versions, the functionality has been extended and now you can call pd.Series methods from query. Not sure how long ago this was introduced.

Let’s turn the form of the table into a long one and delete the word deposit from the column of the same name. Using the _group_top_n_fields function, we select the top five fields, leaving ourselves room for maneuver (suddenly then we want to see not five deposits, but ten):

columns_rename = {1:"показатель",5:'январь',7:'февраль',9:'март',
                  13:"апрель",15:"май",17:"июнь",21:"июль",
                  23:"август",25:"сентябрь",29:"октябрь",
                  31:"ноябрь",33:"декабрь"}

indicator_filter = ['добыча нефти за период']

def _pipe_column_filter(_df, column_name, text_filter="ТПП|СП"):
    _df = _df.assign(**{column_name:(_df['показатель']
                                     .where(_df['показатель'].str.contains(text_filter)
                                            .fillna(False))
                                     .fillna(method='pad'))})
    _df = _df.set_index(column_name).reset_index()
    return _df


def _group_top_n_fields(_df, n=5):
    _df_top_n = _df.groupby(['предприятие', 'месторождение'],
                               as_index=False)['добыча нефти, тыс. тонн'].sum()
    _df_top_n = (_df_top_n
                 .groupby(['предприятие'],
                          as_index=True, group_keys=False)
                 .apply(lambda df_:df_.nlargest(n=n,columns = ['добыча нефти, тыс. тонн'])))
    _df = (_df
           .assign(месторождение=lambda df_:df_['месторождение'].where(df_['месторождение'].isin(_df_top_n['месторождение']))
                   .fillna('остальные')))
    return _df.groupby(['предприятие',"месторождение","месяц"], as_index=False).sum()

(
    df
    .loc[:,columns_rename.keys()]
    .rename(columns=columns_rename)
    # блок с применением метода pipe 
    .pipe(_pipe_column_filter, column_name="объект", text_filter="ТПП|СП|месторождение|[Оо]бъект")      
    .pipe(_pipe_column_filter, column_name="месторождение", text_filter="ТПП|СП|месторождение")    
    .pipe(_pipe_column_filter, column_name="предприятие", text_filter="ТПП|СП")
    # блок с применением метода query
    .query("объект.fillna('-').str.contains('месторождение')")
    .query('показатель.isin(@indicator_filter)')
    .drop(columns = ['объект','показатель'])
    # сворачивание формы таблицы в длинную и выделяем топ 5 месторождений
    .pipe(lambda _df: pd.melt(frame=_df,
                              id_vars=['предприятие','месторождение'],
                              var_name="месяц",
                              value_name="добыча нефти, тыс. тонн"))
    .assign(месторождение=lambda _df:_df['месторождение'].replace(' месторождение', "", regex=True))
    .pipe(_group_top_n_fields, n=5)
)

Now let’s change the month column so that it can be sorted from January to December by using the function pd.date_range and categorical type property in Pandas ordered=True (since there is a lot of code, we will remove it under the spoiler for convenience):

common code
columns_rename = {1:"показатель",5:'январь',7:'февраль',9:'март',
                  13:"апрель",15:"май",17:"июнь",21:"июль",
                  23:"август",25:"сентябрь",29:"октябрь",
                  31:"ноябрь",33:"декабрь"}


indicator_filter = ['добыча нефти за период']

def _pipe_column_filter(_df, column_name, text_filter="ТПП|СП"):
    _df = _df.assign(**{column_name:(_df['показатель']
                                     .where(_df['показатель'].str.contains(text_filter)
                                            .fillna(False))
                                     .fillna(method='pad'))})
    _df = _df.set_index(column_name).reset_index()
    return _df

def _group_top_n_fields(_df, n=5):
    _df_top_n = _df.groupby(['предприятие', 'месторождение'],
                               as_index=False)['добыча нефти, тыс. тонн'].sum()
    _df_top_n = (_df_top_n
                 .groupby(['предприятие'],
                          as_index=True, group_keys=False)
                 .apply(lambda df_:df_.nlargest(n=n,columns = ['добыча нефти, тыс. тонн'])))
    _df = (_df
           .assign(месторождение=lambda df_:df_['месторождение'].where(df_['месторождение'].isin(_df_top_n['месторождение']))
                   .fillna('остальные')))
    return _df.groupby(['предприятие',"месторождение","месяц"], as_index=False).sum()

(
    df
    .loc[:,columns_rename.keys()]
    .rename(columns=columns_rename)
    # блок с применением метода pipe 
    .pipe(_pipe_column_filter, column_name="объект", text_filter="ТПП|СП|месторождение|[Оо]бъект")      
    .pipe(_pipe_column_filter, column_name="месторождение", text_filter="ТПП|СП|месторождение")    
    .pipe(_pipe_column_filter, column_name="предприятие", text_filter="ТПП|СП")
    # блок с применением метода query
    .query("объект.fillna('-').str.contains('месторождение')")
    .query('показатель.isin(@indicator_filter)')
    .drop(columns = ['объект','показатель'])
    # сворачивание формы таблицы в длинную и выделяем топ 5 месторож.
    .pipe(lambda _df: pd.melt(frame=_df,
                              id_vars=['предприятие','месторождение'],
                              var_name="месяц",
                              value_name="добыча нефти, тыс. тонн"))
    .assign(месторождение=lambda _df:_df['месторождение'].replace(' месторождение', "", regex=True))
    .pipe(_group_top_n_fields, n=5)
    # создаем категориальный столбец с возможностью сортировки от января до декабря
    .assign(месяц=lambda _df: pd.Categorical(values=_df['месяц'].str.lower(),
                                             categories=pd.date_range(start="01.01.2023",
                                                                  end='12.01.2023',
                                                                  freq='MS').month_name(locale="Ru").str.lower(),
                                             ordered=True))
    .sort_values(['предприятие','месторождение','месяц'])
    .reset_index(drop=True)
)

To level the number of days in a month (January – 31, February – 28, April – 30), we find the average daily oil production (divide oil production by the number of days in a month). Number of days will be added in pd.Series.map({}) and in order not to complicate the code, we will put it into a dictionary in a separate variable:

map_days_in_month = {i.month_name(locale="Ru").lower():i.daysinmonth
                     for i in pd.date_range(start="01.01.2023",
                                            end='12.01.2023',freq='MS')}

>>> {'январь': 31,
 'февраль': 28,
 'март': 31,
 'апрель': 30,
 'май': 31,
 'июнь': 30,
 'июль': 31,
 'август': 31,
 'сентябрь': 30,
 'октябрь': 31,
 'ноябрь': 30,
 'декабрь': 31}

▍It is interesting that pandas already contains the names of the months in Russian and simply adding to month_name(locale=”Ru”) you can take them from there.

Together:

common code
columns_rename = {1:"показатель",5:'январь',7:'февраль',9:'март',
                  13:"апрель",15:"май",17:"июнь",21:"июль",
                  23:"август",25:"сентябрь",29:"октябрь",
                  31:"ноябрь",33:"декабрь"}


map_days_in_month = {i.month_name(locale="Ru").lower():i.daysinmonth for i in pd.date_range(start="01.01.2023",
                                                                                            end='12.01.2023',freq='MS')}

indicator_filter = ['добыча нефти за период']

def _pipe_column_filter(_df, column_name, text_filter="ТПП|СП"):
    _df = _df.assign(**{column_name:(_df['показатель']
                                     .where(_df['показатель'].str.contains(text_filter)
                                            .fillna(False))
                                     .fillna(method='pad'))})
    _df = _df.set_index(column_name).reset_index()
    return _df




def _group_top_n_fields(_df, n=5):
    _df_top_n = _df.groupby(['предприятие', 'месторождение'],
                               as_index=False)['добыча нефти, тыс. тонн'].sum()
    _df_top_n = (_df_top_n
                 .groupby(['предприятие'],
                          as_index=True, group_keys=False)
                 .apply(lambda df_:df_.nlargest(n=n,columns = ['добыча нефти, тыс. тонн'])))
    _df = (_df
           .assign(месторождение=lambda df_:df_['месторождение'].where(df_['месторождение'].isin(_df_top_n['месторождение']))
                   .fillna('остальные')))
    return _df.groupby(['предприятие',"месторождение","месяц"], as_index=False).sum()

(
    df
    .loc[:,columns_rename.keys()]
    .rename(columns=columns_rename)
    # блок с применением метода pipe
    .pipe(_pipe_column_filter, column_name="объект", text_filter="ТПП|СП|месторождение|[Оо]бъект")      
    .pipe(_pipe_column_filter, column_name="месторождение", text_filter="ТПП|СП|месторождение")    
    .pipe(_pipe_column_filter, column_name="предприятие", text_filter="ТПП|СП")
    # блок с применением метода query
    .query("объект.fillna('-').str.contains('месторождение')")
    .query('показатель.isin(@indicator_filter)')
    .drop(columns = ['объект','показатель'])
    # сворачивание формы таблицы в длинную и выделяем топ 5 месторож.
    .pipe(lambda _df: pd.melt(frame=_df,
                              id_vars=['предприятие','месторождение'],
                              var_name="месяц",
                              value_name="добыча нефти, тыс. тонн"))
    .assign(месторождение=lambda _df:_df['месторождение'].replace(' месторождение', "", regex=True))
    .pipe(_group_top_n_fields, n=5)
    # создаем категориальный столбец с возможностью сортировки от января до декабря
    .assign(месяц=lambda _df: pd.Categorical(values=_df['месяц'].str.lower(),
                                             categories=pd.date_range(start="01.01.2023",
                                                                  end='12.01.2023',
                                                                  freq='MS').month_name(locale="Ru").str.lower(),
                                             ordered=True))
    .sort_values(['предприятие','месторождение','месяц'])
    .reset_index(drop=True)
    # добавляем столбцы `кол-во дней в месяце` и `добыча нефти тыс.тонн/сут`
    .assign(**{'кол-во дней в месяце':lambda _df:_df['месяц'].map(map_days_in_month)})
    .assign(**{"добыча нефти тыс.тонн/сут":lambda _df:(_df['добыча нефти, тыс. тонн']/_df['кол-во дней в месяце']).round(1)})
)

Everything, our plate is ready! Now, you can define all actions either in a function or in a separate class, as you like.

common code with function definition
def processing_oil_table(path):
    
    df_ = pd.concat(pd.read_excel(path,
              sheet_name=None,
              header=None), ignore_index=True)
    
    columns_rename = {1:"показатель",5:'январь',7:'февраль',9:'март',
                  13:"апрель",15:"май",17:"июнь",21:"июль",
                  23:"август",25:"сентябрь",29:"октябрь",
                  31:"ноябрь",33:"декабрь"}
    map_days_in_month = {i.month_name(locale="Ru").lower():i.daysinmonth for i in pd.date_range(start="01.01.2023",
                                                                                            end='12.01.2023',freq='MS')}
    indicator_filter = ['добыча нефти за период']

    def _pipe_column_filter(_df, column_name, text_filter="ТПП|СП"):
        _df = _df.assign(**{column_name:(_df['показатель']
                                         .where(_df['показатель'].str.contains(text_filter)
                                                .fillna(False))
                                         .fillna(method='pad'))})
        _df = _df.set_index(column_name).reset_index()
        return _df
    
    def _group_top_n_fields(_df, n=5):
        _df_top_n = _df.groupby(['предприятие', 'месторождение'],
                                   as_index=False)['добыча нефти, тыс. тонн'].sum()
        _df_top_n = (_df_top_n
                     .groupby(['предприятие'],
                              as_index=True, group_keys=False)
                     .apply(lambda df_:df_.nlargest(n=n,columns = ['добыча нефти, тыс. тонн'])))
        _df = (_df
               .assign(месторождение=lambda df_:df_['месторождение'].where(df_['месторождение'].isin(_df_top_n['месторождение']))
                       .fillna('остальные')))
        return _df.groupby(['предприятие',"месторождение","месяц"], as_index=False).sum()
    return (df_
            .loc[:,columns_rename.keys()]
            .rename(columns=columns_rename)
            # блок с применением метода pipe
            .pipe(_pipe_column_filter, column_name="объект", text_filter="ТПП|СП|месторождение|[Оо]бъект")      
            .pipe(_pipe_column_filter, column_name="месторождение", text_filter="ТПП|СП|месторождение")    
            .pipe(_pipe_column_filter, column_name="предприятие", text_filter="ТПП|СП")
            # блок с применением метода query
            .query("объект.fillna('-').str.contains('месторождение')")
            .query('показатель.isin(@indicator_filter)')
            .drop(columns = ['объект','показатель'])
            # сворачивание формы таблицы в длинную и выделяем топ 5 месторож.
            .pipe(lambda _df: pd.melt(frame=_df,
                                      id_vars=['предприятие','месторождение'],
                                      var_name="месяц",
                                      value_name="добыча нефти, тыс. тонн"))
            .assign(месторождение=lambda _df:_df['месторождение'].replace(' месторождение', "", regex=True))
            .pipe(_group_top_n_fields, n=5)
            # создаем категориальный столбец с возможностью сортировки от января до декабря
            .assign(месяц=lambda _df: pd.Categorical(values=_df['месяц'].str.lower(),
                                                     categories=pd.date_range(start="01.01.2023",
                                                                          end='12.01.2023',
                                                                          freq='MS').month_name(locale="Ru").str.lower(),
                                                     ordered=True))
            .sort_values(['предприятие','месторождение','месяц'])
            .reset_index(drop=True)
            # добавляем столбцы `кол-во дней в месяце` и `добыча нефти тыс.тонн/сут`
            .assign(**{'кол-во дней в месяце':lambda _df:_df['месяц'].map(map_days_in_month)})
            .assign(**{"добыча нефти тыс.тонн/сут":lambda _df:(_df['добыча нефти, тыс. тонн']/_df['кол-во дней в месяце']).round(1)})
           )

>>> processing_oil_table(path="./materials/oil_prod.xlsx")

             предприятие месторождение    месяц  добыча нефти, тыс. тонн  \
0  СП "Смоленскнефтегаз"    LTMJQUJERT   январь                    0.455   
1  СП "Смоленскнефтегаз"    LTMJQUJERT  февраль                    0.401   
2  СП "Смоленскнефтегаз"    LTMJQUJERT     март                    0.433   
3  СП "Смоленскнефтегаз"    LTMJQUJERT   апрель                    0.408   
4  СП "Смоленскнефтегаз"    LTMJQUJERT      май                    0.411   

   кол-во дней в месяце  добыча нефти тыс.тонн/сут  
0                    31                        0.0  
1                    28                        0.0  
2                    31                        0.0  
3                    30                        0.0  
4                    31                        0.0  

We draw graphics.

For deposits, we will use Area plot (or a chart with areas), we will combine together through the pipe function, still adhering to method chaining:

df_oil = processing_oil_table(path="./materials/oil_prod.xlsx")
(
    df_oil
    .pipe(lambda _df:px.area(_df, x="месяц",
                             y="добыча нефти тыс.тонн/сут",
                             color="месторождение",
                             facet_col="предприятие",                             
                             facet_col_wrap=2,
                             facet_col_spacing=0.05))
)

The first part of our task has been completed. Now let’s add the total production for each enterprise (let’s make it an optional indicator) and remove “enterprise=” from the graph titles. You can do it all with lambda, but I will separate a separate function for this so as not to complicate the reading of the code in the future:

def _plot_area(_df, add_sum_venture=False, path_save=None):
    # переменные для построения графиков
    x_month = "месяц"
    y_oil_prod = "добыча нефти тыс.тонн/сут"
    facet_col = "предприятие"
    facet_col_wrap = 2
    color = "месторождение"
    # график по месторождениям
    fig = px.area(_df,
                  x=x_month,
                  y=y_oil_prod,
                  color=color,
                  facet_col_wrap=facet_col_wrap,
                  facet_col=facet_col,
                  facet_col_spacing=0.05)
    # суммарный график по предприятиям
    if add_sum_venture:
        fig.add_traces(px.line(_df.groupby([facet_col, x_month],as_index=False).sum(numeric_only=True).round(2),
                               x=x_month,
                               y=y_oil_prod,
                               facet_col=facet_col,
                               facet_col_wrap=2,
                               text=y_oil_prod).data)

        (fig
         .update_traces(textposition='top center')
         .update_yaxes(range = [0,6]))
    # удаляем "предприятие=" 
    fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
    # сохраняем график, если включен показатель
    if path_save:
        fig.write_image(file=path_save,width=1_000,height=650)
    
    return fig

(
    df_oil
    .pipe(_plot_area, add_sum_venture=True)
)

All is ready:

Final schedule

Final schedule

Then you can still shamanize, recolor, add comments, etc. and so on.

Thanks to everyone who read to the end. The above shows how you can take “traditional code” and refactor it with chaining and functions. Can anyone use this method? Or are there other insights?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *