Useful methods for working with data in Pandas. Part 3


Article author: Roman Kozlov

Head of BI-analytics course

Introduction

Nowadays, the amount of information is growing at an incredible pace. Every day, more and more data is generated and stored in computers, smartphones, cloud services, etc.

The growth in data storage volumes in recent years has led to the development and use of more complex and flexible data storage structures. One such structure is JSON (JavaScript Object Notation), which quickly became popular and widely used due to its ease of reading and flexibility. JSON allows data to be organized as nested key-value pairs, allowing for efficient storage and communication of structured data.

Nested JSON data structures are found in a wide variety of areas. For example, they are used in APIs for exchanging information between clients and servers, in NoSQL databases for storing and processing large amounts of semi-structured data, and in various applications and services that require flexibility and efficiency in working with data. All this makes it important to be able to process and parse complex data structures such as JSON and integrate them into data processing and analysis workflows using tools such as Pandas.

In this article, we’ll look at different approaches to working with nested data structures in Pandas, as well as discuss the process of normalizing JSON structures. Using examples, we will demonstrate how you can efficiently extract and process nested data, converting it into a format that is easy to analyze.

Hierarchical structure in JSON data format.

JSON (JavaScript Object Notation) was developed in the early 2000s by Douglas Crockford as a simple and convenient data interchange format.

The main idea was to provide a lightweight and understandable format for representing structured data that can be used both in browsers and servers. JSON originated from a subset of the JavaScript programming language, but has become independent of it over time, and is now supported by many other programming languages.

One of the reasons JSON has become a popular data storage format is its simplicity and readability for both humans and computers. JSON uses a key-value structure and supports nesting, allowing for efficient storage of hierarchical data. This flexibility makes JSON suitable for a variety of applications, including server-client communication, configuration file storage, and even as an alternative to traditional relational databases.

JSON provides the ability to combine different types of data, such as numbers, strings, booleans, lists, and objects (dictionaries), to represent complex data structures and facilitate data exchange between different systems and programming languages.

Due to the flexibility and convenience in representing hierarchical data in the structure of json objects, data combinations such as lists of dictionaries are often encountered.

Dictionary lists in JSON structures have a number of advantages:

  1. Order of elements: Unlike ordinary dictionaries, lists maintain the order of elements, which allows you to keep the sequence of data.

  2. Flexibility: Dictionary lists can contain a different number of elements and dictionaries with a different set of keys, which allows you to store data of different structure and complexity.

  3. Nesting support: JSON allows you to store nested data structures such as lists of dictionaries inside other dictionaries or lists. This allows you to represent hierarchical data such as trees or graphs.

A Json object using a dictionary list structure might look like this:

[
  {
    "id":1
    "name":"Alice",
    "subjects":[
      {
        "subject":"math",
        "score":85
      },
      {
        "subject":"history",
        "score":90
      }
    ]
},
{
    "id":2,
    "name":"Bob",
    "subjects":[
      {
        "subject":"math",
         "score":95
      },
      {
        "subject":"history",
        "score":88
      }
    ]
  }
]

Each element of the list is a dictionary with information about the student, and inside each dictionary is a list of dictionaries subjectsA containing information about the student’s grades in various subjects. This structure allows you to conveniently store and share information about students and their grades.

Method for expanding a hierarchy from a list of dictionaries. explode method

Converting a list of dictionaries to our familiar Pandas Dataframe tabular format can be tricky, especially if the data structure is heterogeneous or contains several levels of nesting. In such cases, you can use the method explodeto convert the data into an easy-to-analyze format.

Method explodein Pandas is designed to transform columns with nested lists or lists of dictionaries into separate rows, while copying the values ​​of the remaining columns. This allows you to simplify the data structure for further analysis. If there are lists of dictionaries in JSON structures, the method explode can be especially helpful.

Consider an example of working with the method explodeon the example of a JSON structure containing lists of dictionaries. Let’s say we have the JSON object from the previous example:

your_json_string = [
  {
    "id": 1,
    "name": "Alice", 
    "subjects": [
      {
        "subject": "math", 
        "score": 85
      },
      {
        "subject": "history", 
        "score": 90
      }
   ]
},
{
    "id": 2,
    "name": "Bob", 
    "subjects": [
      {
        "subject": "math", 
        "score": 95
      },
      {
        "subject": "history", 
        "score": 88
      }
    ]
  }
]

First, we import the required libraries and load the JSON data into a Pandas DataFrame:

import pandas as pd 
import json

data = json.loads(your_json_string) 
df = pd.DataFrame(data)

We get a Dataframe that looks like this:

To expand a column subjectswhich contains lists of dictionaries, we apply the method explode :

exploded_df = df.explode("subjects",ignore_index = True)

Now our DataFrame looks like this:

As you can see, the column subjects now contains individual dictionaries rather than lists of dictionaries. This simplifies further processing and analysis of the data, since now each row corresponds to a separate dictionary from the original list.

Data normalization. json_normalize method

The next step could be to further transform these dictionaries into separate columns to make the data even more readable for analysis and visualization. To solve this problem, you can use the method json_normalizefrom the Pandas library to convert JSON structures to tabular format.

pd.json_normalize(exploded_df['subjects'])

Using the method pd.json_normalizedata inside the column is normalized subjectwhere nested data on student performance by subject is stored.

To concatenate the original DataFrame exploded_df with a normalized DataFrame created from a column subjects we use the method pd.concat().

Combine data horizontally using the parameter axis=1 and don’t forget using the method drop() delete columnsubjectsfrom the original DataFrame , because they will simply duplicate the normalized data.

pd.concat([exploded_df.drop('subjects', axis = 1), \ 
pd.json_normalize(exploded_df['subjects'])], axis = 1)

So with a combination of methods explode And json_normalize we got from a complex json_structure containing values ​​inside the keys in the form of lists of dictionaries, a tabular form of data recording that is understandable and convenient for analytics.

It should be noted that the methodjson_normalizeapplies to data of a dictionary type, therefore, as an argument, we did not pass the entire dataframe into it, but only that column, the data inside of which is just organized in this way.

Conclusion

In this series of articles, we have explored many methods and techniques for data processing and analysis using Python and the Pandas library. Covering a wide range of problems and tasks, we covered various aspects of working with data, such as splitting data into intervals, quantile splitting, using sliding windows for calculations, data shifting for time series, transforming nested data structures, normalizing complex JSON structures.

Using the methods described in a series of articles allows you to deepen your approach to the analysis of the data under study and make it easier to work with them. They provide extensive tools for processing, cleaning and transforming data, which is an important step in any data analysis project.

Data binning and quantile slicing help in creating segments or groups of data, which can be useful for a variety of purposes such as identifying customer groups, identifying ranges of metric values, and identifying statistical patterns. These methods can also help simplify data analysis by converting continuous values ​​into discrete categories.

Sliding windows and data shifting for time series are key techniques when working with data sequences. They allow you to identify trends, seasonal changes and other features of the time series, and can also be used to create forecasting models and evaluate their accuracy.

Working with nested data structures and JSON structures is becoming increasingly important due to the growing complexity and variety of data sources. The transformation and normalization methods of these structures allow analysts to extract useful information from complex and nested data, making it easier to process and integrate with other datasets.

Mastering and using all these methods in combination with other data analysis tools allows you to delve into the study of the data under study and reveal hidden patterns and relationships. This, in turn, can lead to new insights, better decision-making and data-driven process optimization.

In conclusion, I would like to recommend free webinarwhere my colleagues will talk about what hypotheses are and how to formulate them correctly. For whom and when are A / B tests suitable and when are A / B tests not suitable? They will also talk about alternative ways to test hypotheses.

The webinar is suitable for: analysts, product managers, marketers.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *