analysis and visualization

Faced with a barrage of tasks of varying importance, 3 years ago I decided to start writing down tasks in a planner.to do list which would be convenient to manage both from a phone and a laptop. The choice fell on Notionas a popular interoperable platform. During the use of the glider, many different tasks were performed, and it became interesting to conduct some analysis of how and where the time was spent.

1. Notion planner device

Notion planner is one big page Database with viewing format By Statusfilled with tasks implemented as page. Exactly Status is a key feature that characterizes the state of each task.

For convenience, my statuses take on different meanings, the main 6 of which are:

  • Top Priority for a short task requiring immediate attention

  • Goal as a long-term task/goal that requires a consistent approach but leads to a serious result

and the corresponding 4 hidden statuses:

In order not to multiply unnecessary entities, there are statuses for secondary tasks, personal affairs, possible ideas for projects and plans for joint leisure with a girl right in the planner. Also, recently a column of recurring tasks was added, which needed to be added so that it would always remind about itself, but would not clutter the flow Top priority

General view of the airframe. As you can guess from the name, the project started as a GTD implementation.

General view of the airframe. As you can guess from the name, the project started as a GTD implementation.

At first I also filled in parameters such as Date And Tagto filter tasks by their profile, but over time the need for the latter almost disappeared, so Tag is purely decorative, allowing tasks to be visually separated from each other. But Date It proves to be especially convenient when it is necessary to solve many tasks at once with a close deadline in the shortest possible time.

The most convenient way to use this is to add a viewing method to the database Timelinewhere each task is represented rectangle time interval. From the overlap of tasks, you can create an optimal solution route, thereby reducing possible costs and maintaining a fragile balance (in my case, between study, work and science).

Some tasks are censored due to business ethics

Some tasks are censored due to business ethics

2. Data generation

There are different ways to download data from Notion. The first, and probably the easiest way, is to download it directly from the website. This way you can work out .csv file with all the main features. The second way is to use the official Notion API. I like the second method, because in addition to the basic features of the pages in the planner, it can parse two equally important indicators: created_time, last_edited_time.

Python + requests were used for parsing, pandas was responsible for unloading into .csv

To interact with the official Notion API you need to create a special token. How to do this is described in detail in the spoiler:

Generate a Notion API Secret Token

To obtain a unique key for the API, you need to create an integration link

Here you need to Create new integration

Here you need to Create new integration

Having created the integration and received the code:

It needs to be connected to the page with the planner. Before this, I recommend installing some recognizable image for your integration, so as not to confuse it with someone else's when adding. To do this, on the page with the planner, click on the 3 dots, then on Connect to, where in the drop-down list we look for our integration.

This is what the integration looks like after adding

This is what the integration looks like after adding

Having received the token, feel free to fill it out headers for requests:

headers = {
    "Authorization": "Bearer " + "YOUR TOKEN",
    "Content-Type": "application/json",
    "Notion-Version": "2022-02-22"
}

It is worth noting that “Notion-Version” must be strictly defined, if you make a mistake with it, the site will return with errors indicating possible correct versions.

Parsing is done using requests.request('POST', url, json=payload, headers=headers)Where url this is the address to the database, and payload indicates from which page need to be written down. By default, Notion gives a maximum of 100 pagewhich can be solved by a simple loop.

Full function of db parser
def readEntireDatabase(databaseID, headers):
    url = f"https://api.notion.com/v1/databases/{databaseID}/query"
    page_size = 100 
    payload = {"page_size": page_size}
    
    response = requests.request('POST', url, json=payload, headers=headers)
    data = response.json()
    results = data["results"]
    
    while data["has_more"]:
        payload = {"page_size": page_size, "start_cursor": data["next_cursor"]}
        response = requests.request("POST", url, json=payload, headers=headers)
        data = response.json()
        results.extend(data["results"])
    return results

The resulting file contains a lot of json's of pages, each of which has the following data (dictionary keys):

['object', 'id', 'created_time', 'last_edited_time', 'created_by', 'last_edited_by', 'cover', 'icon', 'parent', 'archived', 'in_trash', 'properties', 'url', 'public_url']

Of these, the ones that are of particular interest to me are created_time, last_edited_time, in_trashas well as the main dictionary lying on the key properties. It contains all the basic features of the pages, including the name and status.

The main features are extracted from json as follows (try-except corresponds to the case when the required property was not specified in Notion):

if not page['in_trash']:
    try:
        names.append(page['properties']['Name']['title'][0]['plain_text']) 
    except:
        names.append('')
        
    try:
        statuses.append(page['properties']['Status']['select']['name']) 
    except:
        statuses.append('')

    try:
        tags.append(page['properties']['Tags']['multi_select'])
    except:
        tags.append('')
        
    try:
        dates.append(page['properties']['Date']['date'])
    except:
        dates.append('')

Having collected from the received sheets pd.DataFrame and saved in .csv we can move on to the analysis.

3. Data analysis and visualization

To analyze the obtained data we will use datetime, nltk and pymorphy2. For visualization and plotting matplotlib and wordcloud.

3.1 Time analysis

First, let's look at the dates obtained from created_by And last_edited_byThey are written in an unreadable format. 2024-07-06T18:18:00.000Z. We will trim the unnecessary number of significant digits for milliseconds from the right end and replace the symbols T on - for consistency

def format_datetime(string : str):
    return string[:-5].replace('T', '-')

This date format can already be read using datetime.

datetime.strptime(date_str, '%Y-%m-%d-%H:%M:%S')

Rewritten in pd.DataFrame columns CREATION_TIME And EDITION_TIME in a readable format and having received their datetime representation, we can play a little with the data analysis, looking at what the distribution of solved/failed tasks looks like on the scale of days/months/years.

Having built the distribution of tasks from the month, one can see a natural dip in the summer, and in principle an expected rise in April and November (these two peaks coincide with the maximum academic workload at the institute). Having calculated the average time of execution and failure (changes in status on Unaccomplished) the tasks turned out accordingly 13 And 57 days.

Distribution of the number of tasks created and completed (status Succesfully accomplished) by month of the year

Distribution of the number of tasks created and completed (status Succesfully accomplished) by month of the year

Plotting code
creation_months = np.zeros(12)
completion_months = np.zeros(12)

for date in data[data['STATUS'] == 'Succesfully accomplished'].CREATION_TIME:
    creation_months[datetime.strptime(date, '%Y-%m-%d-%H:%M:%S').month - 1] += 1

for date in data[data['STATUS'] == 'Succesfully accomplished'].EDITION_TIME:
    completion_months[datetime.strptime(date, '%Y-%m-%d-%H:%M:%S').month - 1] += 1
  
plt.figure(figsize=(7, 5))
plt.style.use('dark_background')
plt.bar(x=np.arange(1, 13), height=creation_months, color="white", label="начато")
plt.bar(x=np.arange(1, 13), height=completion_months, color="lightgray", label="сделано", alpha=0.5)
plt.xlabel('Месяц',fontsize=24)
plt.xticks(np.arange(1, 13)[::2], fontsize=14)
plt.ylabel('Кол-во задач',fontsize=24)
plt.yticks(np.arange(0, 151, 30), fontsize=14)
plt.legend(fontsize=14)
plt.tight_layout()
plt.savefig('months_tasks.jpg')

On a 4-year scale, there is a clear trend of growth in the number of completed tasks (and projects) from year to year. At the beginning of 2024, 2.9 times more tasks and in 3.33 times more projects than in the corresponding period of 2023.

Number of completed tasks per year

Number of completed tasks per year

Plotting code
# data_years_i полученно как value_counts to_dict()
fig, ax = plt.subplots(1, 2, figsize=(20, 7))
plt.rcParams.update({'font.size': 20})

ax[0].barh(width=list(data_years_1.values()), y=list(data_years_1.keys()), color="white")
ax[0].set_yticks([key for key in sorted(data_years_1.keys())])#, fontsize=20)
ax[0].set_xticks(range(0, 451, 90))#, fontsize=20)
ax[0].set_ylabel('Год', fontsize=24)
ax[0].set_xlabel('Количество задач', fontsize=24)

ax[1].barh(width=list(data_years_2.values()), y=list(data_years_2.keys()), color="white", )
ax[1].set_yticks([key for key in sorted(data_years_2.keys())])#, fontsize=20)
ax[1].set_xticks(range(0, 11, 2))#, fontsize=20)
ax[1].set_ylabel('Год', fontsize=24)

The most pleasing to the eye is, of course, the activity map for the entire year, plotted on a daily scale (inspired by GitHub).

Plotting code
data_for_map = sc_data[sc_data['year'] == 2024]

activity = np.zeros((7, 53))

for day in np.arange(7):
    for week in np.arange(53):
        activity[day][week] += len(data_for_map[data_for_map['week'] == week][data_for_map['weekday'] == day + 1])

fig, ax = plt.subplots(figsize=(14, 11))
ax.set_aspect("equal")

plt.style.use('dark_background')
orig_map         = plt.cm.get_cmap('Grays') # инверсия чтобы maximum совпадал с белым 
reversed_map = orig_map.reversed() 

plt.title('Карта активности 2024', fontsize=24)
plt.pcolormesh(activity, cmap=reversed_map, edgecolor="w")
plt.xticks(np.arange(0, 53 ,5), fontsize=16)
plt.yticks(np.arange(0, 7, 2), fontsize=16)
plt.ylabel('Номер дня', fontsize=20)
plt.xlabel('Номер недели', fontsize=20)

3.2 Text analysis

Let's conduct a primary analysis of the problem texts. Let's calculate: 1) the average number of words in a problem, 2) the average number of characters in a problem, and 3) the average word length.

sc_data = data[data['STATUS'] == 'Succesfully accomplished']

average_number_words = sc_data.NAME.str.split().str.len().mean()
average_number_symbols = sc_data.NAME.str.replace(' ', '').str.len().mean() # пробелы не считаем

def avg_word_length(sentence : str):
    words = sentence.split()
    return (sum(len(str(word)) for word in words) / len(words))

average_word_length = sc_data.NAME.apply(lambda x: avg_word_length(str(x))).mean()

In solved problems on average 4.2 words and 24 symbol. Average word length 6.7 letters.

To analyze the problem texts, we will carry out the following preprocessing.

  1. Let's replace all punctuation marks with spaces

  2. Let's convert all words to lower case

  3. Let's break the sentence into lists and combine all the lists into one

  4. Remove all words with length less than 4

Preprocessing code. Part 1
def clear_punctuation(string : str):
    for punct in ['.', ':', ',', '!', '?', ';', '\\', '/', '-', ')', '(']:
        string = string.replace(punct, ' ')
    return string

def filter(text : str):
    filtered = [word for word in text if len(word) > 3]
    return filtered

text = sc_data.NAME.astype(str)
text = text.apply(lambda x: clear_punctuation(x))
text = text.str.lower()
text=" ".join(text.to_list()).split() 
text = filter(text)

The resulting list of words can be placed in pd.Series and using value_counts() get a dictionary with frequency. The resulting dictionary can be nicely represented as a wordcloud:

Wordcloud cup of coffee for getting things done

Wordcloud cup of coffee for getting things done

Wordcloud code
mask = np.array(Image.open('coffee.jpg'))
wordcloud = WordCloud(mask=mask).generate(' '.join(text))

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud.recolor(color_func=grey_color_func, random_state=3),
           interpolation="bilinear")
plt.axis("off")
plt.savefig('completed.png')

Both from the picture and from the frequency dictionary it is clear that the most popular words are verbs that call to immediate action. And instead of deleting the most common words, you can determine for each word its part of speech (using pymorphy2), and delete all unnecessary ones.

So, let's add two more stages to the preprocessing:

  1. Removing stop words

  2. Reduction to the initial form and removal of verbs

Preprocessing code. Part 2
from nltk.corpus import stopwords
import nltk
import pymorphy2

morph = pymorphy2.MorphAnalyzer()

# Create a set of stop words 
stop_words = set(stopwords.words('russian')) 

def remove_stop_words(sentence : list, filter_array=stop_words): 
  filtered_words = [word for word in sentence if word not in filter_array] 
  return filtered_words

def filter_verbs(text: list):
    filtered = [word for word in text if morph.parse(word)[0].tag.POS != 'INFN' and morph.parse(word)[0].tag.POS != 'VERB']
    return filtered

Using preprocessing, we get this nice linguistic cloud of solved problems:

Wordcloud brain made nouns

Wordcloud brain made nouns

Most popular words: a week, ticket, lecture, article.

4. Plans for the future

Continuing the themes raised in this mini-project, we can identify two main possible directions for the development of the idea.

The first is automation of data collection and presentation, possibly in the Streamlit platform format. This would be a nice addition to the Notion experience, especially if we figure out how to add elements of such analysis as widgets directly to the database page.

The second is working with linguistic models for semantic classification of completed tasks into categories, for example, science (my subcategories would be physics and ML), art (cinema and literature), health, etc.
A slightly retrained BERT could handle this task, but I didn’t find any datasets with to-do type tasks, and out of my 1500 tasks, only 60 were labeled (and even then, ~20 are reading and ~40 Styling), so let's leave it for the future.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *