Using Python to collect and preprocess digital footprint data

The digital footprint is usually spoken of only in general terms, and the description of programming for working with it is only mentioned. This article explores a set of Python libraries and techniques that can be used to collect and preprocess digital footprint data.

The concept of “Digital Footprint”

Legally, the concept of “digital footprint” is not fixed; in the literature, it is described as data about a particular person, as well as data about an organization or event. However, more often it is information about people that is meant.

In 2022, a professional standard for a specialist in modeling, collecting and analyzing digital footprint data was adopted in Russia – Profstandart 06.046 | Modeling specialist, digital footprint | Professional Standards 2023. This document also talks about data about a person:

General information: Conducting a comprehensive analysis of the digital footprint of a person (groups of people) and information and communication systems (hereinafter referred to as ICS).

Thus, a digital footprint is data on the Internet related to a specific object, which is most often a person. It is important to keep personal data and intellectual property laws in mind when dealing with the digital footprint. This article discusses the collection of data from the Internet by law.

Stages of collecting and preprocessing a digital footprint

The logical way to collect a digital footprint with software tools is to program human-like logic for working with public data. People search and collect information about something specific through search, and then, studying the pages on the Internet, select information that says about what interests them.

Pre-processing depends on the specific purpose of further work and the type of data collected. Text preprocessing may include cleaning of unnecessary characters and tokenization – dividing the text into words/characters/symbols. Number preprocessing – gap handling and normalization. Image preprocessing – simple formatting.

Thus, the main stages of collecting and preprocessing digital footprint data are:

  1. Sending an HTTP request to the web server of the search engine with a mention of the object of interest;

    URL when performing a search

    URL when performing a search

  2. Obtaining a link to a page on the Internet about the object of interest from the response of the web server, sending an HTTP request to obtain the code of this page;

  3. Selection of the information of interest from the received page code: either manually configured data collection from certain page segments, or checking for the mention of an object in the text, and collecting such sentences;

  4. Digital footprint data preprocessing:

  • Text: cleaning, tokenization;

  • Numbers: gap handling, normalization;

  • Images: simple formatting.

Python libraries for digital footprint collection and preprocessing

Collecting data on the Internet through programs is called parsing, there are various Python libraries for this. Very easy for beginners are “Requests” and “Beautiful Soup”. Here is an article on the subject. When working with a digital footprint, similar actions are performed, you just need to additionally search and select information about one object of interest.

  1. Sending an HTTP request to the web server of a search engine with a mention of the object of interest can be done using the library “Requests“. You need to specify the object in the request, in the example it is specified in the URL.

  2. Obtaining a link to a page on the Internet about an object of interest from a web server response can be done using “beautiful soup‘ – save the ‘href’ attribute of the ‘a’ tag, which sends to a new page. Further, sending an HTTP request to get the code of this page can be done using “Requests».

  3. The choice of the page of the information of interest from the received code of the page can be done either manually by searching for the segment “beautiful soup“, or you can check the mention of an object in the text through a check in a loop and using the built-in library “String“.

  4. Digital footprint data preprocessing:

  • Text: for simple cleaning, you can use the built-in library “String”, for tokenization (breaking the text into units) it is convenient to use “NLTK“.

  • Numbers: You can use “numpy» – remove missing values ​​or replace them with the mean or median, for normalization you can use «scikit-learn“,”pandas“.

  • Images: You can do simple formatting with the “Pillow”.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *