Complex Aggregation in Pandas with MultiIndex

import pandas as pd

# пример данных с несколькими уровнями
data = {
    'country': ['USA', 'USA', 'Canada', 'Canada'],
    'city': ['New York', 'Los Angeles', 'Toronto', 'Vancouver'],
    'year': [2020, 2020, 2020, 2020],
    'population': [8.4, 4.0, 2.9, 2.5]
}

df = pd.DataFrame(data)

# Создание MultiIndex
df_multi = df.set_index(['country', 'city'])
print(df_multi)

At the output you will have a DataFrame with a two-level index. country And city. Now you can perform operations on these levels as on regular indexes, but with additional capabilities.

Managing Index Levels

Okay, we've created a MultiIndex. But what if we want to add a new level or swap existing ones? Pandas provides several tools for this.

Adding an index level: method set_index() can also be used to add indexes on top of existing ones.

# Добавление индекса 'year' как еще одного уровня
df_multi = df_multi.set_index('year', append=True)
print(df_multi)

Reordering levels: for this we use the method swaplevel(). For example, if you need to swap country and city:

df_reordered = df_multi.swaplevel('country', 'city')
print(df_reordered)

Removing index levels: this is done using the method reset_index()which “resets” one or more index levels back into columns.

df_reset = df_multi.reset_index('city')
print(df_reset)

stack() and unstack(): working with nested data

Methods stack() And unstack() allow you to convert index levels to columns and vice versa:

df_stacked = df_multi.stack()
print(df_stacked)
df_unstacked = df_multi.unstack()
print(df_unstacked)

Data Aggregation with MultiIndex

Now let's move on to the most important thing – how MultiIndex helps to perform complex data aggregations. Complex multidimensional groupings and aggregations can be implemented based on groupby() And agg(). Let's say we have a DataFrame with data by year and month, and we want to aggregate data by several levels at once.

Example of grouping at several levels:

# Допустим, у нас есть данные о продажах за несколько лет по странам и городам
data = {
    'country': ['USA', 'USA', 'Canada', 'Canada'],
    'city': ['New York', 'Los Angeles', 'Toronto', 'Vancouver'],
    'year': [2020, 2020, 2020, 2021],
    'sales': [100, 200, 150, 130]
}

df = pd.DataFrame(data)
df_multi = df.set_index(['country', 'city', 'year'])

# Группируем по странам и считаем общие продажи
df_grouped = df_multi.groupby(level="country").sum()
print(df_grouped)

This code will show us the total sales by country for all years. But we can go even further and aggregate the data by levels within levels. For example, using agg()you can combine different types of aggregation:

df_aggregated = df_multi.groupby(level="country").agg({'sales': ['sum', 'mean']})
print(df_aggregated)

Here we calculate both the total and the average sales for each level at the same time.

Working with data slices by MultiIndex levels

When there is a MultiIndex, the standard data access methods of the type .loc[] become even more powerful. For example, you can extract data by a single level or by a combination of levels.

# Доступ к данным для определенного города
print(df_multi.loc[('USA', 'New York')])
# Получение данных по конкретному году
print(df_multi.xs(2020, level="year"))

Other useful features

Let's not stop there. Here are some useful features that can be useful when working with MultiIndex:

  1. Checking for MultiIndex: Sometimes you need to quickly check if there is a MultiIndex in a DataFrame.

print(df_multi.index.is_multi)
  1. Getting level names:

print(df_multi.index.names)
  1. Unpacking index levels into columns to make things easier:

df_flat = df_multi.reset_index()
print(df_flat)
  1. Combining multiple indexes into one level: when you need to combine several indexes into one, you can use pd.MultiIndex.from_tuples().

new_index = pd.MultiIndex.from_tuples([('USA', 2020), ('Canada', 2021)], names=['country', 'year'])

This is how Pandas MultiIndex turns complex multidimensional data into easily manageable structures, where aggregation and manipulation of levels is done in a couple of lines. If your data is more complex than regular tables, MultiIndex is your best friend.

Master powerful data analysis skills – requirements analysis + statistics + BI and get a sought-after profession on the course “Data Analyst”.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *