From mail server to Greenplum

The idea of creating this project arose from a real need: I wanted to better control my expenses in large grocery chains, especially in stores like VkusVill. The main task was to systematize your purchases by product category, analyze this data and see the dynamics of expenses. Of course, there are applications that can already build such graphs, and even VkusVill itself provides this opportunity in its personal account. But my goal was much deeper – I wanted to create my own system where data from different stores could be aggregated in one place. This would allow not only to analyze cost information, but also to use it for more complex calculations and visualizations.

In addition, I always wanted to get all this data in a convenient and customized format that could be easily adjusted to my needs. Off-the-shelf applications often do not provide the flexibility to change product categories or drill down to the required level. Therefore, for this project, I decided to use tools such as Python to automate the processes of data extraction and processing, and Greenplum as a database for storing and subsequent analysis.

In this article I will tell you how, using Python and the Greenplum database, I was able to automate the process of extracting data from letters from the VkusVill store, structuring them and loading them into the database for further processing. During the data transformation stage, I looked at how to extract key information from emails (for example, product names, quantities, prices, and total order value) and then transform this data into a format convenient for further analytics.

The project covers the entire lifecycle of data, from extracting it from external sources (email), processing and converting it into a usable format, to loading it into a powerful analytical database where complex queries can be run and results visualized.

So let's begin.

Introduction

Tools used:

Python for data extraction and processing (IDEA PyCharm)
Greenplum for data storage (dbeaver client -> Greenplum database)
imaplib to connect to the mail server
pandas for working with data
psycopg2 to upload data to Greenplum

Preparatory steps:

Setting up a mail server to retrieve emails (Extract).
Writing a parser to process data from letters (Transform).
Converting the received data into CSV format (Transform).
Loading data into a temporary Greenplum database table (Load).

EXTRACT

This is where data is retrieved from the mailbox. We connect to the mail server, retrieve unread messages and save them for further processing.

1. Connect to the mail server

import imaplib
import email
from email.header import decode_header
import base64
import mail_config

# Загрузка данных для подключения из конфигурационного файла
mail_pass = mail_config.mail_pass  # Пароль почтового ящика
username = mail_config.username  # Имя пользователя (логин)
imap_server = "imap.mail.ru"  # Адрес почтового сервера

# Подключение к серверу через SSL (защищенное соединение)
imap = imaplib.IMAP4_SSL(imap_server)

# Логинимся на почтовый сервер с помощью логина и пароля
imap.login(username, mail_pass)

# Выбираем папку с входящими письмами для работы
imap.select("INBOX")

2. Retrieving unread emails

# Поиск всех непрочитанных писем в почтовом ящике
unseen_mails = imap.search(None, 'UNSEEN')  

# Преобразуем результат поиска в строку для дальнейшей обработки
unseen_mails_str = str(unseen_mails[1])

# Выводим список ID непрочитанных писем
print('Непрочитанные письма: ', parsing_list_unseen_email(unseen_mails_str))

A couple of comments on the block above:

I use the command imap.search to search for all unread emails with the key 'UNSEEN'.
The result is a string with email IDs, which can be processed using the function parsing_list_unseen_email.

3. Parsing unread emails

def parsing_list_unseen_email(unseen_emails: str):
    """
    Функция для получения списка ID непрочитанных писем.
    :param unseen_emails: Строка с ID непрочитанных писем.
    :return: Список ID писем.
    """
    lst_id_unseen_emails = []  # Пустой список для ID
    digit_char=""  # Переменная для временного хранения цифр
    for char in unseen_emails:
        # Если текущий символ — цифра, добавляем его к временной строке
        if char.isdigit():
            digit_char += char
        else:
            # Если цифры закончились, сохраняем их как ID и обнуляем временную строку
            if digit_char:
                lst_id_unseen_emails.append(digit_char)
                digit_char=""
    return lst_id_unseen_emails  # Возвращаем итоговый список ID

Comments again:

The function processes a string with email IDs, breaking it into individual elements (numbers) and returns a list of IDs for further work.
We go through each character of the string, extracting only the numbers representing the IDs of the letters.
This is necessary in order to know which emails have not yet been read
Important: if you programmatically opened an email, it will disappear from the UNSEEN folder

4. Extract a specific letter

# Извлекаем одно из писем по его ID
res, msg = imap.fetch(b'3123', '(RFC822)')  # Здесь 3123 — это ID письма
msg = email.message_from_bytes(msg[0][1])  # Преобразуем байты в объект сообщения

# Читаем и выводим заголовок письма (например, тему)
print('\nЗаголовок письма:\n', decode_header(msg["Subject"])[0][0].decode())

imap.fetch allows you to retrieve the contents of a letter by its ID. That is, here you can explicitly indicate which letter to read.
Library in use email to decode the bytes into a readable message, and then extract the header (such as the subject line). We need to find a header that will say that the letter came from VkusVill.

5. Extracting the body of the message

def extract_multipart(msg):
    """
    Функция для извлечения содержимого письма с возможностью обработки вложенных сообщений.
    :param msg: Объект сообщения.
    """
    with open('D:/Mail_read_files/email_body.txt', 'w', encoding='utf-8') as f:
        # Если письмо многокомпонентное (содержит вложенные элементы)
        if msg.is_multipart():
            for part in msg.walk():  # Проходим по всем частям письма
                # Если часть письма — это текст (HTML-формат)
                if part.get_content_type() == 'text/html':
                    # Раскодируем содержимое и записываем его в файл
                    f.write(base64.b64decode(part.get_payload()).decode())
        else:
            # Если письмо не содержит вложенных частей, просто записываем его содержимое
            f.write(base64.b64decode(msg.get_payload()).decode())
    print('[INFO] Файл email_body.txt создан')

is_multipart() checks whether the message contains nested elements. This is important for processing emails with multiple parts. It was a very difficult moment for me personally, since such letters are like a nesting doll (you need to go deeper into the structure several times to get to the body of the letter).
If the email contains HTML, the content is saved to a file for further analysis.
We use base64.b64decodesince the contents of letters are often encrypted.

TRANSFORM

After we have extracted the contents of the letter, we move on to analyzing it. In the project in this example, I am looking at letters from the Vkusvill store containing information about purchases.

1. Parsing HTML content

import pandas as pd

# Инициализация пустых списков для хранения данных
st = {'product_name': [], 'count': [], 'price': [], 'total_price': [], 'order_date': [], 'shop_name': []}

def transform_data_vkusvill():
    """
    Парсинг письма от магазина "Вкусвилл" для извлечения данных о покупках.
    """
    with open(r'D:/Mail_read_files/email_body.txt', 'r', encoding='utf-8') as f:
        lines = f.readlines()  # Чтение всех строк файла
        for i, line in enumerate(lines):
            # Ищем строку с названием магазина
            if 'АО "Вкусвилл"' in line:
                shop_name="АО "Вкусвилл""  # Сохраняем название магазина
                # Дата заказа находится на 14-й строке после найденного названия
                order_date = lines[i + 14].split('<')[0].strip()
            # Ищем строку с информацией о товаре
            if 'width="40%"' in line:
                st['order_date'].append(order_date)  # Сохраняем дату заказа
                st['shop_name'].append(shop_name)  # Сохраняем название магазина
                
                # Парсинг названия товара
                product_line = lines[i + 2]
                if ',кг' in product_line or ',шт' in product_line:
                    product_name = product_line.split(',')[0].strip()  # Название товара
                    st['product_name'].append(product_name)
                
                # Парсинг цены, количества и общей суммы
                st['price'].append(float(lines[i + 4].replace(',', '.')))
                st['count'].append(float(lines[i + 6].replace(',', '.')))
                total_price = float(lines[i + 10].split('<')[0].replace(',', '.'))
                st['total_price'].append(total_price)

transform_data_vkusvill()  # Вызов функции для парсинга

A few words about the block:

I extract data about the store, order date, product name, price and quantity of goods using specific markers (for example, 'JSC “Vkusvill”', 'width=”40%”'). I think that these fields will be enough for further analysis and visualization.
This data is stored in a dictionary st for further processing.

2. Convert to DataFrame and save to CSV

# Преобразование данных в DataFrame
df = pd.DataFrame(st)

# Сохранение DataFrame в CSV для дальнейшей загрузки
df.to_csv('D:/Mail_read_files/email_body.csv', sep=';', encoding='utf-8', index=False)

LOAD

At the final stage, I load the data from the CSV file into a temporary table in the Greenplum database.

1. Loading data into the database

import psycopg2
import mail_config

def load_data_to_temp_table():
    """
    Загрузка данных из CSV в временную таблицу Greenplum.
    """
    answer = input('Точно загрузить новую пачку данных? Напишите "да" или "нет": ')
    
    # Проверка на подтверждение действия пользователя
    if answer.lower() == 'да':
        # Подключаемся к базе данных Greenplum
        with psycopg2.connect(
            database=mail_config.db_name,
            user=mail_config.user,
            password=mail_config.password,
            host=mail_config.host,
            port=mail_config.port
        ) as conn:
            # Открываем курсор для выполнения SQL-запросов
            with conn.cursor() as cur, open('D:/Mail_read_files/email_body.csv', 'r', encoding='utf-8') as file:
                # Загружаем данные из CSV в таблицу базы данных
                cur.copy_from(file, 'email_body_temp', sep=';')
                conn.commit()  # Подтверждаем транзакцию
                print("Данные успешно загружены")
    else:
        print("Загрузка отменена")

load_data_to_temp_table()  # Вызов функции для загрузки данных

Conclusion

This was my first pet project, in which I had to immediately use several key tools necessary to work as a Data Engineer. I understand that this article may seem complicated, but don't be intimidated. Try breaking down the code line by line and analyzing each step – this is a great way to better understand how everything works. Usually, after this approach, many points become clearer and understandable.

I have tried to annotate the code in detail to help you better understand each step of the process. I hope this helps you in mastering similar tasks.

This project is just the first step, the basic stage of extracting data from mail. In the next article, I'll tell you how to automate the process by scheduling it, and I'll also show you how to connect BI tools to create visualizations so you can see your data in a convenient and beautiful form.

I'm driving mine blog in telegram by data engineering. I write about the region, work cases, and a little lifestyle. It will be of interest more to newcomers to the field, but I will also support the conversation with middle+ engineers. I will be glad if you find useful things for yourself in my channel.

Also link to repositorywhere you can read my materials.

From mail server to Greenplum

Introduction

EXTRACT

1. Connect to the mail server

TRANSFORM

1. Parsing HTML content

LOAD

Conclusion

Test Automation: Java or Python?

Current Security Threats in Large Language Model Applications

How to build beautiful graphs in Python with Seaborn

Introduction to Weights & Biases

Thailand Digital Nomad Visa 2024

Deploy Hyperledger Fabric 2.0 on AWS

Leave a Reply Cancel reply

Introduction

EXTRACT

1. Connect to the mail server

TRANSFORM

1. Parsing HTML content

LOAD

Conclusion

Similar Posts

Leave a Reply Cancel reply