what is it and how to calculate it?


In this post, we will talk about the Churn Rate metric, we will have to deal with such questions as:

  • What is Churn Rate?

  • Why is this metric needed?

  • How to calculate it using SQL?

  • How to calculate it using Python?

Our task is to give answers to the questions as quickly and clearly as possible, why the metric is needed and how to calculate it correctly using various technical tools. The article provides code for generating data so that everyone can try to calculate the metric on their own.

What is Churn Rate

Metrics Churn Rate allows you to answer the question, what proportion of users stopped using the service, in other words, fell off. A dropped user is one who, from the day of his previous activity, has not logged into the service for some fixed period of time (10 days, 2 weeks, 1 month, etc.). Why is a time window used? The answer is simple – we cannot wait all our lives to say for sure whether this or that user will return or not. For this reason, we use the absence metric within a certain period, and:

  • The larger the time window, the more accurate the metric.

  • The smaller the time window, the sooner we will know the result.

Example: On December 1, 100 users entered our application, we are interested in what proportion of them won’t come in service again within 28 days. Let’s say there were 17 such users → Churn Rate = 17%. (we can calculate the metric after 28 days from the date of interest to us).

Calculation formula

Churn Rate = [кол-во отвалившихся пользователей] / [ кол-во пользователей, которые были активны в выбранный день]

Going back to the previous example, we have:

  • Day X: December 1

  • Time window: 28 days

  • Date X + time window: current date + 28 days (until December 29)

  • Users who were active on the selected day: 100 people who logged into the application on December 1

  • Fallen Users: 17 people who logged into the app on December 1st but didn’t log in again for 28 days, i.e. until December 29

It is important to understand that Churn Rate in most cases will not be equal 1 – Retention Rate (percentage of users who were active on date X and returned to the service after a certain period of time).

Why Churn Rate1 – Retention Rate ? Because within the period, users fall off more than one fell swoop. Here is the graph Retention Rate for different periods on date X:

We see that some of the users were active during the Churn time window → if a user was inactive on day 28, this does not mean that he was also absent throughout the previous period, so Churn Rate1 – Retention Rate (but if Retention Rate would be cumulative, then the equality holds).

Metric calculation in practice

Technical tools:

To calculate the metrics, we will use synthetic data. We generate them like this:

1. Import libraries

import names
import random
import datetime as dt
import pandas   as pd

2. Generate an array of values ​​for the future dataframe (may take several minutes)

# здесь будут храниться все имена пользователей, которые мы будем использовать
# таких имен у нас будет не больше 10000 (они не обязательно окажутся уникальными)
names_list = []
for i in range(10000):
    names_list.append(names.get_full_name())
    
# значения, которые будут подставляться случайным образом
platforms = ['ios', 'android']
columns   = ['event_date', 'platform', 'name', 'event_type']
events    = ['purchase', 'productPreview', 'addToCart', 'productClick', 'search', 'feedbackSent']

# формируем соотвествие имени и платформы пользователя
# (для простоты предполагаем, 
# что каждый пользователь использует только одну из двух возможных платформ)
names_platform_dict = {}
for name in names_list:
    names_platform_dict[name] = random.choice(platforms)
    
# формируем массивы юзеров по степени уникальности
regular_users     = names_list[:2000]
not_regular_users = names_list[2000:]

# здесь будут храниться строки для будущего датафрейма
df_values = []

# основной цикл
# проходимся по всем датам с 1 декабря 2021 по 29 мая 2023
for date_index in range(180):
    date = dt.date(2022, 12, 1) + dt.timedelta(days=date_index)
    
    # сколько строк будут соответствовать одному дню
    rows_number  = random.randrange(2000, 3000)
    
    for name_index in range(rows_number):
        # здесь мы задаем недельную сезонность
        # в будни доля регулярных юзеров будет выше
        if (date_index + 2) % 7 == 0 or (date_index + 3) % 7 == 0:
            list_of_names = random.choices([regular_users, not_regular_users], weights = [4, 1])[0]
        else:
            list_of_names = random.choices([regular_users, not_regular_users], weights = [7, 1])[0]
        
        name       = random.choice(list_of_names)
        event_type = random.choice(events)
        platform   = names_platform_dict[name]
        
        # добавляем строку в массив всех строк для будущего датафрейма
        df_values.append([date, platform, name, event_type])

3. Received dataframe

# создаем датафрейм
events            = pd.DataFrame(data=df_values, columns=columns)
events.event_date = pd.to_datetime(events.event_date)
events

In our case, the original dataframe looks like this:

You can also find basic information

4. Data upload

# В исходной директории, где лежит файл с кодом, можно будет найти csv-файл
events.to_csv('./example_events.csv')

Context

We work for a company that sells goods through an application on a smartphone. The application exists on two platforms: Android and iOS. The company considers the churn metric with a window of 28 days.

How to Calculate Churn Rate Using SQL

We uploaded the data to Google BigQuery (the environment can be any).

Let’s move on to calculating the metrics.

  • Since it is customary to calculate the metric with a window of 28 days (in the described case), it is incorrect to calculate the metric for those dates from the moment of which the 28th day has not come. For this reason, we will cut these lines at the last stage of the request.

  • Since a user can have multiple events on the same day, it’s a good idea to leave only the unique matches for platform, user, and date beforehand.

  • For each user on the date in question, we find the date of his next event, then calculate the number of days before this date. The resulting number of days will be compared with the time window of the metric. If a user has been absent for ≥ 28 days from the date in question, they are churned.

The result is the following code:

-- находим уникальные соответствия имени, платформы и даты активности

-- нас волнует, был ли пользователь активен в тот или иной день
-- нет разницы, что именно он делал
WITH names_days AS (
    SELECT DISTINCT
        platform,
        event_date,
        name,
        -- оконной функцией находим дату следующей активности пользователя
        LEAD(
            event_date
        ) OVER(PARTITION BY name ORDER BY event_date) AS next_event_date
    FROM example_events
),

main AS (
    SELECT
        *,
        -- находим, через сколько дней случилось следующее событие пользователя
        DATE_DIFF(next_event_date, event_date, DAY) AS days_till_next_event
    FROM names_days
)

SELECT
    event_date,
    -- находим долю пользователей:
      -- которые отсутствовали не меньше 28 дней с этого дня от числа всех пользователей
      -- и которые были активны в этот день

    -- также сюда можно добавить срез по платформе

    -- `distinct` ниже в данном случае не необходим,  
    -- но при ином формате исходных данных он может понадобиться
    COUNT(DISTINCT
        IF(
            days_till_next_event < 28,
            NULL,
            name
        )
    ) / COUNT(DISTINCT name) AS churn_rate
FROM main
WHERE 1 = 1
    -- срезаем строки, где не прошло 28 дней с текущей даты 
      AND event_date <= (SELECT DATE_SUB(MAX(event_date), INTERVAL 28 DAY) FROM example_events)
GROUP BY 1
ORDER BY 1

If there are no window functions in the version of SQL you are using, you can:

  • Return to sorted by name, event_date source table through a subquery and apply the function LEAD / LAG without a window (providing in SELECT name matching condition. EX: IF(LAG(name) = name, LAG(event_date), NULL)

  • Number lines in sorted by name, event_date original table and join it to itself through table.index = table.index + 1 AND table.name = table.name

At the output, we get a table in which the Churn Rate indicator is calculated for each date with a window of 28 days:

The result of the received query is not very convenient to analyze, looking at the result of the SQL query. Therefore, we visualize the result using a BI tool:

We may also be interested in how the metric behaves depending on the chosen platform. In order to display such a graph, you need to add an additional grouping field to the original SQL query platform (also don’t forget to include it in SELECT)

In both cases, seasonality is observed: on weekends, the outflow of users is higher.

How to calculate metric using Python

Importing libraries.

import numpy             as np
import datetime          as dt
import pandas            as pd
import seaborn           as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick

Load data from csv file.

events = pd.read_csv('./example_events.csv', parse_dates = ['event_date'])

Let’s perform similar transformations using Python. To begin with, we will leave only unique combinations. name, platform, event_date:

events_grouped = events.groupby(['event_date', 'platform', 'name'])\
                       .agg({'event_type' : 'count'})\
                       .reset_index()

events_grouped.columns = ['event_date', 'platform', 'name', 'events_number']

Now for each user we will find the date of his next event. We also find how many days later his next event took place:

events_grouped['next_event_date']      = events_grouped.groupby(['name', 'platform'])['event_date'].shift(-1)
events_grouped['days_till_next_event'] = (events_grouped['next_event_date'] - events_grouped['event_date']) / np.timedelta64(1, 'D')

For simplicity, let’s add a marker whether we consider the user to be “blackened” on the date in question

events_grouped['is_churned_int'] = events_grouped.days_till_next_event.apply(lambda x: 0 if x < 28 else 1)

After all the transformations, the data looks like this:

We have everything to calculate the Churn Rate metric. Let’s find the maximum date from which 28 days have passed, cut off unnecessary lines. After that, for each date, we calculate the number of unique users (DAU) and the number of those who fell into the outflow. Then we calculate the share of those who fell into the outflow of the total.

# ищем максимальную дату, для которой подсчет метрики можно считать корректным
max_relevant_date = events_grouped.event_date.max() - dt.timedelta(days=28)

churn_rate = events_grouped.query('event_date <= @max_relevant_date')\
                           .groupby(['event_date'])\
                           .agg({'is_churned_int' : 'sum', 'name' : 'nunique'})

churn_rate.columns = ['churned', 'dau']

# расчитываем метрику
churn_rate['churn_rate']      = churn_rate['churned'] / churn_rate['dau']
# переводим ее в проценты
churn_rate['churn_rate_perc'] = churn_rate['churn_rate'] * 100

We get the following table with the Churn Rate metric:

We visualize the results obtained using Python:

palette = sns.color_palette(['#ff6063', '#32b6aa', '#ffcc53'])

sns.set_palette(palette)

plt.figure(figsize=(20,10))
plt.grid()

ax = sns.lineplot(data=churn_rate, y='churn_rate_perc', x='event_date')

ax.set_title ('Churn Rate', fontsize=15)
ax.set_xlabel('Дата', fontsize=15)
ax.set_ylabel('Процент оттока', fontsize=15)

ax.set(ylim=(0, 50))
ax.yaxis.set_major_formatter(mtick.PercentFormatter())

plt.show()

If we want to calculate the metric separately for the platform, then the field platform must be added to the grouping fields, as well as placed in the legend when plotting graphs

churn_rate_platform = events_grouped.query('event_date <= @max_relevant_date')\
                                    .groupby(['event_date', 'platform'])\
                                    .agg({'is_churned_int' : 'sum', 'name' : 'nunique'})

churn_rate_platform.columns = ['churned_names', 'dau']

churn_rate_platform['churn_rate']      = churn_rate_platform['churned_names'] / churn_rate_platform['dau']
churn_rate_platform['churn_rate_perc'] = churn_rate_platform['churn_rate'] * 100
plt.figure(figsize=(20,10))
plt.grid()

ax = sns.lineplot(data=churn_rate_platform, y='churn_rate_perc', x='event_date', hue="platform")
ax.set(ylim=(0, 50))
ax.yaxis.set_major_formatter(mtick.PercentFormatter())

ax.set_title ('Churn Rate', fontsize=15)
ax.set_xlabel('Дата', fontsize=15)
ax.set_ylabel('Процент оттока', fontsize=15)

plt.show()

Post Scriptum

The analyzed method of counting is classical. In some cases, it is more expedient to calculate the metric in a different way, it all depends on the business context and the task being solved.

For example, we are often not interested in looking at an indicator built on all events. If we are talking about search engines, which are often set in the browser as the “default page”, then taking into account something from the category “User opened the page” when calculating Churn does not seem to be very useful, because this can hardly be considered as user activity.

If we are talking about a web platform, it is a well-known fact that among new users who visit a resource, an impressive proportion leaves it almost instantly. It is better to monitor the share of such users separately (Bounce Rate), and also build the churn metric without them.

Possible questions

1. There are dates for which we still do not know whether the user logged in in the subsequent period or not (we are talking about the metric calculation window). What happens if you do not cut such lines?

In this case, after the first date for which there is no date in the analytics that exceeds it by the time window for calculating the metric, the metric will float up. We will get such a result because we will not know about the subsequent activities of the user → in the calculation it will be considered that he was in the outflow. The graph in this case would look something like this:

2. But the metric does not tell us who the service has lost forever!

Yes, it is, but what is “actual churn” anyway? We can never know for sure if a user will log in. to our service again (even if he was absent from it for several years). The metric is needed in order to track the trend in dynamics, it will not tell us how many users we have lost forever.

3. An anomalous change in the metric is observed, what are the causes?

There can be infinitely many reasons for changes in the Churn Rate metrics, and for each sphere they will be different. Let’s try to cover the main ones:

  1. Advertising

  • redistributed budgets;

  • advertising campaigns have been changed;

  • new channels added.

  1. Product changes

  • prices have been changed;

  • changed the type of product (application, website, widget, etc.);

  • functions have been changed (the application now has the ability to make calls or now it is impossible to pay for goods with some kind of electronic wallets).

  1. Other

It is more important to be able to answer the questions:

  1. What audience is included in the indicator?

  2. What could have caused the negative experience?

  3. What can disturb the audience now?

4. How to choose a window for calculating the metric?

The size of the window used depends on the business area and our expectations. We need to understand that using a larger window for counting, we will wait longer, but we will get more accurate results.

Thank you for your attention!

Thank you for reading to the end! We hope the material will be useful.

If you have any questions, we will be happy to answer them in the comments. Also, if you think it would be cool to consider other topics in this way, write about them. We will try to touch on them in one of the following posts.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *