Read mail.ru mail from python using imap

We analyze in detail the work of the imaplib and email libraries, open the mailbox and read letters (we get everything that is from them) using mail.ru as an example (although in general, it should work everywhere).

The work tasks forced us to turn to the classics – e-mail, there is quite a lot of material on the net, but a detailed detailed presentation was not enough, I share the results of the research, who have not yet encountered this task, I hope it will be useful.

"Robot reads emails"generated by Kandinsky Sber AI, Sber Devices
“Robot reads emails”, generated by Kandinsky Sber AI, Sber Devices

Getting Started:

We need libraries:

import imaplib
import email
from email.header import decode_header
import base64
from bs4 import BeautifulSoup
import re

Before starting, you need to create a password from your mail.ru account to access the mailbox. To do this, go to the settings, select “All security settings” and in the “Login methods” select “Passwords for external applications”, create a password.

Preliminary preparation is over, we begin to write.

Connection and authentication:

The imap mail server is located at imap.mail.ru. The login will be the address of your mailbox, the password we just created.

mail_pass = "пароль от ящика для внешних приложений"
username = "адрес_ящика_на@mail.ru"
imap_server = "imap.mail.ru"
imap = imaplib.IMAP4_SSL(imap_server)
imap.login(username, mail_pass)

If everything went well, a message will appear: [b’Authentication successful’]

We went to the mailbox. To get to the letters you need to go to the folder with them, by default the inbox is the INBOX folder

You can view a list of all folders with the imap.list() command

Receiving letters

To get to the letter, you need to literally open the folder and go into it, it is done like this:

imap.select("INBOX")

Returns something like this tuple (‘OK’, [b’19’]), the first is the status of the operation, the second is the number of letters in the folder.

Now you need to find out the number of the letter, and there are at least two of them – the serial number in the folder and the UID, which is also tied to the order of the numbers in the folder, but does not change (there is still a Message-ID).

The search method of the imaplib library returns the sequence number of letters in the mailbox from first to last.

Letters are arranged in the box in numerical order. If we search without any parameters, we get a list of letter numbers.

imap.search(None, 'ALL')
>>('OK', [b'1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19'])

You can make sure that there are really 19 letters.

The search can be carried out and more specifically, the “UNSEEN” argument will return all the numbers of unseen letters:

imap.search(None, "UNSEEN")

In response, we will receive the sequence numbers of unread letters like this: (‘OK’, [b’12 16 19′]) the first argument is the status of the operation, followed by a list of bits with letter numbers.

It should be borne in mind that if you delete a letter, then all the numbers will move. Those. for the task of reading new letters, this moment is not fundamental, but if you need to return to messages, you can get their UID, an invariable number. To do this, use the IMAP4.uid(command, arg[, …])

imap.uid('search', "UNSEEN", "ALL")

For the same letters, we will get different numbers: (‘OK’, [b’14 24 28′]), i.e. the serial number of the letter is 12, and uid – 14, 16 – 24 and 19 – 28. With uid it is already possible to carry out more complex operations – store, access, they will be exactly for those letters in which you receive them.

We receive a letter and extract some information about it

"Robot reads emails"generated by Kandinsky Sber AI, Sber Devices
“Robot reads emails”, generated by Kandinsky Sber AI, Sber Devices

Knowing the number of the letter, now you can finally get it.

res, msg = imap.fetch(b'19', '(RFC822)')  #Для метода search по порядковому номеру письма
res, msg = imap.uid('fetch', b'28', '(RFC822)')  #Для метода uid

The number must be passed as a string str (num) or bytes, ints will not work.

After this operation, the letter will be marked as read in the mailbox.

In response, we will receive a tuple of bytes, the first one will contain a serial number, a standard, and some other number.

In the second slot of the tuple, there will be our future email object. Retrieve the email using the message_from_bytes method of the email library:

msg = email.message_from_bytes(msg[0][1])

The type of the msg object will be email.message.Message. Directly from it, without looking inside, you can extract almost everything except the text of the letter and attachments (sometimes the text can also be extracted).

latter_date = email.utils.parsedate_tz(msg["Date"]) # дата получения, приходит в виде строки, дальше надо её парсить в формат datetime
latter_id = msg["Message-ID"] #айди письма
latter_from = msg["Return-path"] # e-mail отправителя

print(type(latter_date), type(latter_id), latter_id, type(latter_from))

<class 'tuple'> <class 'str'> <1662997113.166751447@f221.i.mail.ru> <class 'str'>

Everything is simple here.

Getting From and Subject, the first difficulties

From and Subject are also requested by msg[“From”], difficult to answer. If they are completely written in Latin, then they are extracted in the same way as the previous ones. But there is another option if the letter came without a subject or the “From” and “Subject” fields are written in Cyrillic.

msg["Subject"] # тема письма написана кириллицей и закодирована в base64
'=?UTF-8?B?RndkOiDQn9GA0LjQs9C70LDRiNC10L3QuNC1INCyINC90L7QstGL0Lkg0KI=?=\r\n =?UTF-8?B?0LXRhdC90L7Qv9Cw0YDQuiDQsiDRgdGE0LXRgNC1INCy0YvRgdC+0LrQuNGF?=\r\n =?UTF-8?B?INGC0LXRhdC90L7Qu9C+0LPQuNC5IMKr0JjQoi3Qv9Cw0YDQusK7INC40Lw=?=\r\n =?UTF-8?B?0LXQvdC4INCRLtCg0LDQvNC10LXQstCwINC4INCc0LXQttC00YPQvdCw0YA=?=\r\n =?UTF-8?B?0L7QtNC90YvQuSBTdGFydHVwIEh1Yg==?='

This is MIME + Base64 encoding, can be decoded manually, the desired text is between the characters =? and ?= and then base64.b64decode().decode(). Or you can use the decode_header method, which we import from email.header:

decode_header(msg["Subject"])
[(b'Fwd: \xd0\x9f\xd1\x80\xd0\xb8\xd0\xb3\xd0\xbb\xd0\xb0\xd1\x88\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb2 \xd0\xbd\xd0\xbe\xd0\xb2\xd1\x8b\xd0\xb9 \xd0\xa2\xd0\xb5\xd1\x85\xd0\xbd\xd0\xbe\xd0\xbf\xd0\xb0\xd1\x80\xd0\xba \xd0\xb2 \xd1\x81\xd1\x84\xd0\xb5\xd1\x80\xd0\xb5 \xd0\xb2\xd1\x8b\xd1\x81\xd0\xbe\xd0\xba\xd0\xb8\xd1\x85 \xd1\x82\xd0\xb5\xd1\x85\xd0\xbd\xd0\xbe\xd0\xbb\xd0\xbe\xd0\xb3\xd0\xb8\xd0\xb9 \xc2\xab\xd0\x98\xd0\xa2-\xd0\xbf\xd0\xb0\xd1\x80\xd0\xba\xc2\xbb \xd0\xb8\xd0\xbc\xd0\xb5\xd0\xbd\xd0\xb8 \xd0\x91.\xd0\xa0\xd0\xb0\xd0\xbc\xd0\xb5\xd0\xb5\xd0\xb2\xd0\xb0 \xd0\xb8 \xd0\x9c\xd0\xb5\xd0\xb6\xd0\xb4\xd1\x83\xd0\xbd\xd0\xb0\xd1\x80\xd0\xbe\xd0\xb4\xd0\xbd\xd1\x8b\xd0\xb9 Startup Hub',
  'utf-8')]

It also returns a tuple of values, we need zero, actually the “Subject” of the letter, it has already been decoded into escaped Unicode sequences, it remains to translate the characters into readable text:

decode_header(msg["Subject"])[0][0].decode()
'Fwd: Приглашение в новый Технопарк в сфере высоких технологий «ИТ-парк» имени Б.Рамеева и Международный Startup Hub'

If there is no subject line, msg[“Subject”] will return NoneType

All this can be done in several other ways, for example, you can get the subject of the letter like this:

imap.fetch(b'19', "(BODY[HEADER.FIELDS (Subject)])")

But, as for me, the option that is given before is more understandable.

Everything that could be obtained without diving further, we got, go to the text and attachments.

Finally, the text of the letter! Oh, no, you still have to tinker

"Robot reads emails"generated by Kandinsky Sber AI, Sber Devices
“Robot reads emails”, generated by Kandinsky Sber AI, Sber Devices

To continue, we need to get its payload from the email.message.Message object using the msg method.get_payload().

And how kind we are says the documentation for the email library the result could be:

  1. simple text message

  2. binary object

  3. a structured sequence of submessages, each of which has its own set of headers and its own payload.

To immediately deal with this issue, the method is used.is_multipart(), which actually tells you how to proceed with the letter. Those. immediately determine the third option, which is the real nesting doll.

Let’s go in order. If we received a simple text message… No, of course it’s not simple at all, but encoded in base64, but everything seems to be simple here, we take and decode and… we can get, for example, an HTML code that is almost readable, but it’s also better to clean it up (that’s why BeautifulSoup is in the libraries).

We translate the binary object into a text one, and here we do the same as in the first case.

is_multipart() == True or a structured sequence of submessages

If the resulting object consists of a group of other objects, we start iterating. Passing through parts can be done with a simple loop:

payload=msg.get_payload()
for part in payload:
    print(part.get_content_type())  

multipart/alternative
application/pdf

But there is a catch, the resulting parts can also be composite, i.e. cycles need to be complicated. The walk method greatly simplifies this issue.

for part in msg.walk():
    print(part.get_content_type())

multipart/alternative
text/plain
text/html
application/pdf

There are letters in which some of these components are also components. The code above from the documentation illustrates just such a case.

The normal pass returned two objects, which actually returned as a result of .get_payload(), and .walk() gives four objects, the fact is that it unpacks the component parts of attachments. If you do the same in the standard way, you get an approximate code like this:

payload=msg.get_payload()
for part in payload:
    print(part.get_content_type())
    if part.is_multipart():
        level=part.get_payload()
        for l_part in level:
            print(l_part.get_content_type())

We return to receiving letters. To to understand here we need RFC2045.

Each email object is provided with headers that tell you about the rich inner world of the object and, accordingly, suggest means of retrieving it. Usually, the text of the letter is stored in the first part of the payload, and the attachments are stored in the rest. But this is not necessary, it can be a multipart type without attachments.

So the methods are:

  • get_content_type()

  • get_content_maintype()

  • get_content_subtype()

Types are divided into single (discrete-type) and composite (composite-type) single (this is our destination) can include: “text” / “image” / “audio” / “video” / “application” / extension-token , to multipart: “message” / “multipart” / extension-token (RFC 2045), this must be expanded.

We are interested in the text, so we go through payload with the condition part.get_content_maintype() == ‘text’

for part in msg.walk():
    if part.get_content_maintype() == 'text' and part.get_content_subtype() == 'plain':
        print(base64.b64decode(part.get_payload()).decode())

The subtype can be of several types, plain and html (I have not seen others yet). The message can contain both one of them, and both, so we select the desired option using the conditions.

If the subtype is html, we return to bs4, or regs (this is after decoding from base64).

The text is received, you can proceed to the attachments.

payload.get_content_disposition() == ‘attachment’

Attachments are caught in parts of the letter as well as text, by the condition get_content_disposition() == ‘attachment’.

get_content_type() will tell us the type of attachment (/”image” / “audio” / “video” / “application” /) and a more specific flavor of it, such as application/pdf. The header also contains the name of the file. If the name is not in Latin, then welcome to MIME + Base64.

for part in msg.walk():
    print(part.get_content_disposition() == 'attachment')
    
False
False
False
False
True

The result for the four parts of the above letter. You will also have to tinker with the names of attachment files

for part in msg.walk():
    if part.get_content_disposition() == 'attachment':
        print(part.get_filename())
        print(base64.b64decode('=0L/QtdGC0LDQvdC6LnBkZg==').decode())
        print(decode_header(part.get_filename())[0][0].decode())

=?UTF-8?B?0L/QtdGC0LDQvdC6LnBkZg==?=
петанк.pdf
петанк.pdf

The attachment files themselves are also here, they are encrypted and do not touch anyone.

for part in msg.walk():
    if part.get_content_disposition() == 'attachment':
        print(part)

Content-Type: application/pdf; name="=?UTF-8?B?0L/QtdGC0LDQvdC6LnBkZg==?="
Content-Disposition: attachment; filename="=?UTF-8?B?0L/QtdGC0LDQvdC6LnBkZg==?="
Content-Transfer-Encoding: base64
Content-ID: <18336a9fbb54fb78aa51>
X-Attachment-Id: 18336a9fbb54fb78aa51

JVBERi0xLjYNJeLjz9MNCjEgMCBvYmoNPDwvTWV0YWRhdGEgMiAwIFIvT0NQcm9wZXJ0aWVzPDwv
RDw8L09OWzkgMCBSXS9PcmRlciAxMCAwIFIvUkJHcm91cHNbXT4+L09DR3NbOSAwIFJdPj4vUGFn
ZXMgMyAwIFIvVHlwZS9DYXRhbG9nPj4NZW5kb2JqDTIgMCBvYmoNPDwvTGVuZ3RoIDMzNTYzL1N1
YnR5cGUvWE1ML1R5cGUvTWV0YWRhdGE+PnN0cmVhbQ0KPD94cGFja2V0IGJlZ2luPSLvu78iIGlk
PSJXNU0wTXBDZWhpSHpyZVN6TlRjemtjOWQiPz4KPHg6eG1wbWV0YSB4bWxuczp4PSJhZG9iZTpu
czptZXRhLyIgeDp4bXB0az0iQWRvYmUgWE1QIENvcmUgNy4yLWMwMDAgNzkuMWI2NWE3OSwgMjAy
...

But the analysis of this part, perhaps, is a completely different story.

"Robot reads emails"generated by Kandinsky Sber AI, Sber Devices
“Robot reads emails”, generated by Kandinsky Sber AI, Sber Devices

Instead of a conclusion

Why is all this necessary, especially since many aspects and questions are touched upon. Very simply, all this can be wrapped in a simple bot and, with peace of mind, remove the mail application from the phone and read new letters directly in the messenger.

If someone needs it, use it to your health: https://github.com/Sstoryteller2/mail_reader

If about a more practical application, for example, automating the confirmation of reading and launching deadline counters for answers, and much more.

Here there is step by step notebook with all examples. I hope it will be useful, save time on reading the documentation and analyzing cases.

Bibliography:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *