Spam is OK! Mass mailings as a stimulus for the development of civilization

I found myself in a time when there was practically no spam yet – every advertising letter in my email inbox seemed like something outlandish, and a significant part of them was read out of pure curiosity. However, soon the volume of such garbage began to grow exponentially, Trojans appeared that sent copies of themselves to a list of contacts on an infected machine, then crooks mastered phishing… Millions of people around the world suffered and continue to suffer from spam. But if spam didn't exist, it probably should have been invented. And here's why.

Those who remember the epic with the notorious “Center for American English”, which filled the mailboxes of Russians with their annoying advertising at the beginning of the 2000s, they will probably remember that such mass mailings did not begin suddenly. In the nineties, mail advertising looked rather primitive: in this way, mainly miracle remedies for hair growth and enlargement of other important parts of the body, healing biological supplements and financial pyramids were promoted. In the headers of some letters you could even see a list of all recipients. Some messages were completely targeted: advertisements for goods and services were sent to a potentially interested audience in areas of activity – this was facilitated by numerous thematic Internet catalogs of companies and their websites. One of the turning points in the evolution of the spam industry can be considered the emergence of crawlers, programs that automatically collected email addresses on web pages and combined them into databases. At the same time, mass mailers emerged to send spam, and address databases became a hot commodity.

Remember this enchanting spam that eventually became a household name?

It was enough to “expose” your e-mail one time on some forum, and the consequences were not long in coming: after just a couple of days, you had to clear out tons of advertising junk from your mailbox. The fight against this phenomenon was carried out with very primitive methods: if somewhere in the public space it was necessary to leave an e-mail address, it was placed in the form of a picture, the numbers were replaced with text, and the “@” symbol was replaced with the word “dog” in order to confuse the automatic assembly programs. Often, people simply created a separate “spam” box, and shared their “personal” address only as a last resort. It helped, but not for long – spam somehow miraculously leaked even into those email accounts whose addresses you kept in the strictest confidence.

The next evolutionary step on the difficult path of combating mass mailings was the emergence of customizable mail filters. At first, filtering rules were compiled manually: it was necessary to select characteristic headings and text fragments of advertising messages, then enter them into the appropriate program window, indicating what the email client should do with such a letter, and finally activate each rule separately. I remember very well how I set up Outlook Express to automatically delete letters from all sorts of “promotion specialists” – a very meditative and tedious task. A little later, ready-made sets of antispam filters began to spread, which could be enabled or disabled with a simple mouse click. But the effectiveness of such protection still left much to be desired: firstly, spammers found more and more new methods of bypassing pattern filtering algorithms – by replacing Cyrillic characters in the text with Latin characters, adding random spaces to words, or converting message text into a picture. And secondly, due to incorrect operation of the filters, important and necessary correspondence often ended up in spam, so the lists of messages sent to the corresponding folder still had to be viewed – which is what the organizers of mass mailings wanted.

“Blacklists,” used in corporate email systems at the height of the 2000s, are a separate song with an obscene chorus. I remember one large St. Petersburg publishing house with which I collaborated in those days: their servers were configured to accept letters only from their own domain and a single Russian public e-mail service, everything else was mercilessly burned out with napalm. No spam, but also no useful messages from authors, clients and partners if they managed to register their mailbox somewhere else – convenience comes first! And “public blacklists,” entries in which were sometimes made on the basis that the robot randomly substituted someone’s address from the spam database as the sender during the mailing process, brought more problems than practical benefit. In general, the effectiveness of such anti-advertising measures was questionable from the very beginning. It’s good that such lists eventually disappeared into oblivion.

With the spread of phishing, email filters have learned to check the links contained in letters against databases of potentially dangerous and malicious sites. Here, developers are faced with the same difficulty as antivirus manufacturers that use signature analysis. Until a malicious or phishing link is included in the database, the filter does not consider it dangerous, and a certain amount of time passes from the moment such a link appears until it is added to the lists. Attackers quickly mastered both wholesale domain registration and automatic generation of malicious URLs. In other words, in this “war of armor and projectile” the advantage usually remained on the side of the latter.

The appearance of heuristic algorithms and “self-learning filters” changed the situation a little for the better – this was the next evolutionary step in the fight against spam. Heuristic algorithms analyze the text of a letter for the presence of certain words and expressions characteristic of advertising letters, and take into account not only the very fact of the presence of these words, but also evaluate their relative position and context. Many spam messages use specific HTML markup or hidden tags. Heuristic analysis identifies markup anomalies, such as invisible links, hidden blocks of text, images, and other tricks spammers use to bypass filters. The presence of links in the body of the letter is also checked. Feedback from the user is also important: if the filter considers a useful message to be spam, or vice versa, skips advertising, the user can mark it manually to adjust the algorithm and increase the accuracy of filtering.

Need I say that the organizers of mass mailings carefully study the principles of operation of such filters and are actively looking for methods to bypass them? Heuristics are undoubtedly many times more effective than template-based filters, but they still don’t work perfectly. I regularly notice that algorithms send useful mailings from various services to spam, while advertising and phishing messages, on the contrary, are missed.

Finally, relatively recently, artificial intelligence, supported by machine learning technologies, has entered the field of combating mass mailings – both advertising and malicious. Unlike a set of static rules and traditional heuristic algorithms, ML approaches are able to constantly learn, independently detect new spam patterns, and even predict what is advertising and what is not. These algorithms are initially trained on huge data sets containing examples of both spam messages and normal mail. The model analyzes what features are typical for each type of email and independently builds rules that allow you to identify spam even in new forms. For this purpose, various classification methods are used, such as logistic regression, decision trees, and support vector machine (SVM).

More complex and advanced systems may use recurrent neural networks (RNNs) or transformers. These architectures are able to analyze the text of messages and understand their context, which is especially useful for identifying spam with unusual patterns. Content analysis of emails is usually performed using mathematical linguistics and natural language processing (NLP) algorithms, such as the Word2Vec model. AI systems are also capable of identifying anomalies in typical communication patterns: they can already build profiles of “normal” user behavior in order to then highlight suspicious messages that deviate from this “norm.” And combining and combining several algorithms increases the overall accuracy and quality of mail traffic filtering. For example, one algorithm evaluates the text of a letter, another evaluates the structure, a third evaluates the sender’s behavior, and the final decision is made based on the result of their combined work. Spammers don’t have much of a chance against such a “Terminator” with artificial intelligence. Although the organizers of mass mailings themselves willingly take advantage of the capabilities of neural networks, so sooner or later some cunning trick will be found against this “crowbar”.

If, with some degree of convention, we place all these anti-spam methods on a timeline, we will see that the progress of technologies, which have evolved from simple manually configured filters to the use of neural networks, has noticeably accelerated in recent years.

The evolution of anti-spam technologies

And guess what? Thanks to this, the Internet as a whole has become much safer. Thus, in the film with the participation of Yandex 360 about cybersecurity “Digital Shadow: How do they steal our money?” interesting statistics were announced: 99% of Russians aged 20 to 55 use email, and this is 10% more than the audience of popular instant messengers. Previously, the company shared the following data: since the beginning of 2024, it has blocked over 16 billion potentially dangerous letters using Spam Defense, that is, about 25% of all mail traffic is rejected by spam. What is surprising about these numbers? Firstly, despite the advent of mobile devices and applications like Telegram and WhatsApp, email still remains the main communication tool on the Internet. And secondly, a quarter of the total volume of letters sent online is a very, very large amount.

This data is indirectly confirmed by independent research published on the IEEEXplore platform: the number of email messages sent daily is constantly growing: in 2024 it reached 361.6 billion emails, and by 2026 it should exceed 392 billion.

Number of emails sent and received daily, 2017-2026, according to IEEE Xplore

At the same time, according to the agency 99firms.com, Russia leads by a significant margin among the world's sources of spam.

Distribution of countries by volume of spam sent, 99firms.com

And if middle-aged people, who literally grew up with a computer in their arms, are still able to be skeptical about letters from the lawyers of the late Nigerian prince, and look with caution at the “photos” attached to messages with the extensions .lnk and .js, then their elderly parents find themselves in risk group simply due to age. Sometimes it’s very difficult to explain to a seventy-year-old mother why you shouldn’t trustfully click on a link in a letter supposedly from the “Administrator of your mailbox” with a message that it’s time to change your password, especially if this administrator is her son, who uses his own mail server, and he If necessary, you can simply call. The same applies to children who are greedy for promises of freebies and recklessly subscribe to any dubious services and mailing lists in search of cheats for games and ready-made solutions for math homework. In such cases, mail filtering mechanisms and AI-based “smart assistants” will come in very handy. Where natural intelligence begins to slip, artificial intelligence must work.

There is another important aspect. Until recently, methods of combating advertising and malicious mailings were reactive: that is, spammers came up with a new method of bypassing filters or began to use a previously unheard-of social engineering technique – application developers responded with another filtering rule and updating the databases. But AI and ML tools do not just react to threats that appeared approximately the day before yesterday, but adapt to the evolution of spam. Thanks to their ability to self-learn, such filters maintain high efficiency even when mailing organizers change tactics. New types of advertising and fraudulent emails will be filtered out faster, and the number of false positives should, on the contrary, decrease proportionally. It is unlikely that this will help completely defeat this phenomenon itself, but it will definitely ruin the lives of spammers.

So, according to IEEEXplore researchthe use of ML algorithms made it possible to increase the accuracy of spam detection to 95-99%. Machine learning-based systems adapt to new types of spam much faster than template-based tools or pre-programmed filters. Other research the same company shows that algorithms such as Bayesian filtering and deep learning methods can significantly reduce the number of false positives (for example, when using natural language processing approaches). AI-based antispam systems can predict the behavior of spammers, which makes such filters less vulnerable to circumvention methods often used by mass mailers.

But no matter how the methods of protection against spam, phishing and the spread of malware are improved, the “armor-versus-projectile debate” does not stop. With the increasing sophistication of spam attacks, especially those aimed at fooling probabilistic and Bayesian algorithms, increasingly flexible and intelligent filtering technologies are required. This process resembles a kind of arms race: each new improvement in antispam technology stimulates spammers to develop more advanced bypass methods, which, in turn, accelerates the development and improvement of protective systems. The result is increased resiliency of the entire email infrastructure and reduced risks for end users. New tools based on artificial intelligence and machine learning help protect them not only from existing threats, but also from yet unknown threats. This is why I personally think spam is the engine of progress: it's certainly harmful, but it ultimately makes the digital world a little safer.