Google Sheets in OSINT: Data Scraping

Hello everyone! We are talking again about the use of spreadsheets in the work of OSINT analysts. first part In our series of articles, we discussed the automation of dork generation (advanced search query operators). In second — touched on the integration of API for data research. Today we will cover the topic of scraping: we will analyze the basics, make a service for checking convictions and reading RSS, and also integrate search results from sites into our Tables.

What is scraping?

Web scraping (or scraping, or scraping from English web scraping) is a technology for obtaining web data by extracting it from the pages of web resources. Scraping is widely used for tracking prices and goods, competitive intelligence, monitoring news or content, as well as extracting contact or other significant information.

Let's start from afar. The following script will allow us to request the status of a particular page on the network:

function getPageStatus(url) {

try {

var response = UrlFetchApp.fetch(url, {

'muteHttpExceptions': true

});

var statusCode = response.getResponseCode();

return ” ” + statusCode + ” ” + (statusCode == 200 ? “FOUND” : “NOT FOUND”);

} catch (e) {

return “Error: ” + e.message;

}

}

At the output, we will receive information about the availability of the page (code 200), its absence (404), exceeded waiting time (429) or errors in DNS records. We will need this for various tasks – from checking nicknames in social networks to scanning web resources.

Now let's get straight to scraping. The =IMPORTXML function is responsible for launching it in Google Sheets. First, let's try to copy the data from the description of our Telegram channel, located at https://t.me/tomhunter.

Let's create a new table… By the way, to do this, you can simply enter sheet.new in the Chrome browser address bar. In cell A2, we put the address https://t.me/tomhunter. Now we put the function in cell B2:

This will allow us to copy the web page title into the table.

Let's try using another function:

This will dump all the contents of the web page into a table. This is usually where problems arise, since the contents may be spread across several cells. To collect all the information in one cell, we will change the function as follows:

Let's clean up the data we've received. To do this, we'll create the following formula:

With its help we will delete the text contained in the cell, starting with a certain symbol or word. In our case, it is “var”, which is where the technical information of the web page begins.

Now let's extract the contacts specified in the channel description from the received data. To do this, we use the formula =REGEXEXTRACT, as well as regular expressions. A regular expression is a pattern that the regular expression processor tries to match with the entered text. For example, the regular expression by which the program recognizes an email address in text will be as follows:

And the formula that we will insert into the cell:

You can use such regular expressions to search the cell text for phone numbers, user nicknames, hyperlinks, hashtags, and other relevant information.

Scraping of courts and individuals

Now let's try to apply the above using the example of a Telegram channel to exchange data with a court decisions website:

This will allow us to load information about criminal, administrative and civil cases of an individual into the table. Let's remember our first article and generate a ready-made hyperlink that will contain the search page https://sudact.ru/regular/doc/?regular-txt= and a query like “Last name I.O.” I will illustrate this using the example of Mikhail Aleksandrovich Babel, who is already known to us and is on the federal wanted list.

Now let's form a formula for this web page:

We have a bunch of unstructured data. To split it, we use the formula =SPLIT. For example:

This will allow us to spread the data from cell A3 across other cells if the ~ symbol is present between the blocks in the text.

Let's try to clean up the search results here too. Let's assume that we are interested in getting only the case numbers. In that case, we can make the following formula:

And as you can see in the screenshot, we have displayed the numbers we are interested in in a separate cell. And as a result of working with Google Sheets and scraping the court decisions website, we get a convenient service that allows us to find all the information we are interested in about individuals in a couple of clicks in the table.

Scraping the website code

Let's move on to more complex forms of scraping. So, we have search results from the page https://sudact.ru/regular/doc/?regular-txt=Ба Бел М.А. Let's go to this page, right-click and select the “View code” function. The web page code will open on the right side of the screen. Click “Elements” – “Select elements on the page to check”. Next, click on the element that displays all the results of the court decisions found. Copy the address of the desired element in Xpath format. You should get something similar:

Now we write a new formula:

It will allow you to upload to the table only the information contained in the selected element, and also distribute it across adjacent cells. Let's try to reproduce this functionality for the description page of our Telegram channel. The formula will be as follows:

Using the formula allowed us to unload the channel description line by line into our table. Now let's extend the formula for studying a group of channels. To do this, you can stretch it in the table.

I would also like to note that we can import RSS feeds into our table. The =IMPORTFEED function is responsible for this. Let's say that we generated an RSS feed using Google Alerts. We can output it to the table using the following formula:

Or using this one:

Agree, it's cool? The main thing is that all this is done with the help of an easily accessible service and does not cost millions of dollars, like various analytical software for special services. Using the examples from this article, you can easily request information even from sites that do not have a convenient API for processing such requests.

And we will continue the rubric and in the next article we will tell about other possibilities of spreadsheets for analyzing information. In particular, about visualizing data on a map or in the form of a timeline. Good luck in working with OSINT and subscribe to our blog!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *