Alternative list of resources blocked in the Russian Federation Re:filter

In this article I want to describe the problem that I encountered as a user of existing lists of blocked resources, steps towards a solution and the results of the work. The article is less practical and more theoretical, so for practical benefit you can immediately skip to the section 3. Result.

The solution to the problem was started within the framework of the project Demhack 8and I would like to express my gratitude to the organizers for their support.

  1. Problem

  2. Path to solution

  3. Result

1. Problem

I’ll start with the basis of everything – the RKN list, which at the time of this publication contains 621,750 domains (793,386 URLs according to Roskomsvoboda). This list has been updated since 2012 and contains, to put it mildly, not only resources that are valuable to us, but also what should be filtered by such a body as the RKN. The volume of the list results in the complexity of generating lists to bypass locks, Antifilter.download these are lists of IP addresses containing about 10,000 subnets /24 or more (or about 155,000 prefixes from resolution), Antifilter.network – 26,000 subnets from /32 to /23 (or 140,000 prefixes from resolution). My main interest is routers that can use summarized lists of IP addresses, which leads to problems in the form that sites like mosoblgaz.ru (91.215.42.46), speedtest.net (151.101.66.219) are located on public hosting and being summarized by mask with point routing, they begin to work via VPN, like many other resources that fall into this range, due to one two-address RKN list, which is completely unnecessary.

2. Path to solution

Having started studying the RKN list, I was wondering what resources from the list caused the damage to mosoblgaz.ru and speedtest.net given in the example, and this is for example:

Mosoblgaz

Speedtest.net

Which begs the question, do we really need all these blocked addresses and domains?

My answer is noand here's why:

As @Furriest wrote in his article (special thanks to him for his help with BGP):

At the same time, in fact, 99% of the resources on the RKN blacklist are useless trash, like the one hundred and fifty-hundredth mirror of another online casino.

In the process of preparing this material, I came to similar figures and became convinced of the thesis that in order to effectively bypass blocking, one cannot rely only on the RKN registry, and use its contents limitedly. I came up with the following list generation scheme:

It was decided to clear the list of RKN domains according to the following scheme:

Stage 1 – Filtering domain names by dictionary

At this stage I took list of domains at Antifilter.download and filtered by keywords, the source code of the algorithm is available Here (step 1), if desired, you can repeat the whole process yourself (if you understand my shitcode)

At the time of writing the project execution, the domains.lst file contained 577829 domain names, at this stage we filter out 101990 domain names (with the right keywords, I think more are possible) – so in the list for now 475839 domain names

Let's move on to Stage 2

Stage 2 – checking resource availability

At this stage (step 2 in the source code) two main criteria were checked –

  • Website accessibility using one of the methods – ping, http, https

  • Is the site a mirror – does it return an HTTP 301 response? If it does, then only redirections to itself are ignored (for example http://www.facebook.com redirects to https://www.facebook.com this is not a mirror)

Based on the results of this check, we filtered out 281307 domains left [194532 domain

Stage 3 – checking the content of the main page by tags

The 3rd stage (step 3 in the source code) was the most resource-intensive, essentially similar to the first, but at this stage you need to filter out parked domains and sites with normal domains but useless content

At the end of this stage, there remains 41598 domains

Due to the high resource consumption of a procedure consisting of 3 stages (mainly 2 and 3), it is difficult to carry out it on a regular basis, and it is necessary to replenish the lists with up-to-date information, and here data from Open Observatory of Network Interference (OONI) who regularly publish data on blockings around the world.

On Stage 4 (step 4) a list of blocked domains will be generated based on OONI data – the code takes a sample for a week and processes it according to the scheme below

Formation scheme ooni_domains.lst

Formation scheme ooni_domains.lst

Next we move on to the final 5th stage:

Stage 5 – Formation of the final list of domains and resolution

At the final stage (folder step 5), you need to create a general list of domains from the filtered RKN list, the OONI list and the community list (which stores domains that are not blocked by RKN, but which you would like to open through a VPN like netflix.com) with the removal of duplicates – at the output get a list of 42662 domains (as of 10/13/2024) including Discord domains. We could end there, but for practical use IP addresses are also needed, because Routing is often done at Layer 3 of the OSI model, while most blocking is at higher layers.

To correctly determine the IP addresses of domains and correctly summarize them, it is necessary not only to use DNS records for a specific domain, but also to keep in mind that some domains such as Facebook.com use thousands of IP addresses that are not reflected in the DNS records of the main domains, which means you need to contact data from autonomous subnets (ASN), but summing addresses by ASN is not possible for all domains, otherwise the situations described at the beginning of the article are possible. I took into account the following points, I defined them as rules:

  • Lists of subnets from ASN can only be used for ASNs that correspond to a domain (for example, facebook.com and its AS32934, facebook is not a hosting and the chances of foreign resources getting there are minimal)

  • For domains located in the ASN of hosting, you can only use DNS records, for example the same Speedtest.net which refers to AS54113 from Fastly, Fastly is a hosting, which means summarizing over large masks like /24 can have dire consequences

Resolving lists by DNS records was carried out separately, after which a single IP list was created in which duplicates were excluded and at the last stage subnets were added from the ASN directory and summarization took place by mask, but not more than /28. In the final list of IP addresses as of 10/13/24 – 23434 IP addresses including IP addresses of Discord servers, which is more than suitable for use on routers.

3. Result

What's the end result?

The resulting lists (available in repositories):

  • number of current blocked domains – 42662 domain

  • list of IP addresses obtained from resolution – 23434 IP addresses

Which incl. used to assemble lists V2Fly, Xray – geoip.dat And geosite.dat,

Sing-box geoip.db, geosite.db

BGP Server is running in test mode – 165.22.127.207 (AS number 65412) where the list is distributed ipsum.lst – about how to configure – on OpenWRT (Linux) and Mikrotik

The result is a list of domains in which 6.2% of the records from the original ones in the RKN lists remain (and you can filter more) + there is an up-to-date and quick way to replenish them, not only based on RKN data, but also on the basis of an independent source (YouTube still does not exist on the RKN lists, but there are blockages)

Answering the question “If there is so much garbage on the RKN list, why not abandon it altogether?” – There is a lot of garbage in the RKN list, but there are also quite valuable resources, the existence of which many of us are not aware of, and which are unlikely to be included in the lists created from scratch, but they are in the RKN list

In the near future, we plan to make an automatic build based on GitHub Actions and catch errors – the resolution had to be done from 0, so errors are possible.

Thanks for reading, go to GitHub and mark the repository, leave feedback and suggest domains for community.lst.

I hope that some of my colleagues, in particular: Antifilter.download, Antifilter.network, @ValdikSS, may find these developments useful in their projects.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *