Data validation

Hello, Habr! On the eve of the start of the course “Network Architect” we suggest reading the translation of a useful article.


Optimizing the data model and removing duplicates is great, but how can we make sure that we are working with valid data model?

This question can be easily answered within a traditional IPAM / CMDB implementation using an internal database and custom data processing logic offering REST API, GUI, or both. The custom data processing logic validates the data before entering it into the database to ensure that the database contains syntactically and semantically valid data.

In a simpler solution that uses text files to store the network data model (also known as source of truth or source-of-truth), it is difficult to thoroughly validate each transaction, especially if you use a text editor to modify these files. In these cases, you need to write your own validation pipeline using tools that validate:

  • text file syntax;

  • compliance with the data model schema;

  • referential integrity.

Using our latest data model with per-link prefixeswhich are stored as an Ansible heap host_vars files and network.yml file, the validation pipeline must check that:

  • All files conform to YAML syntax (to do this, you can use tools such as yamllint);

  • Host facts contain values hostname and bgp_as for each host;

  • The network data model contains a value linkswhich is an array core and edge links;

  • Core links contain prefix and at least two other values;

  • Edge links contain a single value, which is a dictionary (or object if you prefer JSON terminology) with a single value.

You can write a small program in any programming language to perform these tests, or you can use data modeling languages ​​(also called schemes) such as YANG, JSON Schema, or XML Schema, which will impose most of the necessary check constraints. Since YAML files are easy to convert to JSON, we will use jsonschema

It is often difficult to verify referential integrity using a data modeling language. To do this, you might have to write your own software solution, but at least you can shove the boring routine of checking data structures and formats onto a third-party solution.

Host data validation

The first step in our data model validation logic is to check the facts of the Ansible host. These facts are often spread over multiple files and directories, or generated on the fly using an external script or Ansible plugin. So the best way to get them is to pass the task to the program ansible-inventorywhich creates a JSON data structure as required by the external inventory script.

$ ansible-inventory -i ../hosts --list
{
    "_meta": {
        "hostvars": {
            "S1": {
                "bgp_as": 65001,
                "description": "Unexpected",
                "hostname": "S1"
            },
            "S2": {
                "bgp_as": 65002,
                "hostname": "S2"
            }
        }
    },
    "all": {
        "children": [
            "ungrouped"
        ]
    },
...

From the resulting JSON data structure, we only need to extract the host variables, and jq perfect for this job:

$ ansible-inventory -i ../hosts --list|jq ._meta.hostvars
{
  "S1": {
    "bgp_as": 65001,
    "description": "Unexpected",
    "hostname": "S1"
  },
  "S2": {
    "bgp_as": 65002,
    "hostname": "S2"
  }
}

If we want to use the command line utility jsonschema, we need to save the results in a text file and then call the utility jsonschema with the name of the text file and JSON Schema file against which the data should be validated.

$ jsonschema -i /tmp/hosts.json hosts.schema.json
{'bgp_as': 65001, 'description': 'Unexpected', 'hostname': 'S1'}: 
Additional properties are not allowed ('description' was unexpected)

Validating the network data model

For check network.yml file we will use a similar approach:

  • Convert YAML file into JSON format with yq

  • Convert YAML file to JSON format using yq

  • Run jsonschema on the resulting JSON file

  • Let’s run jsonschema on the received JSON file

yq <network.yml . >/tmp/$$.network.json
jsonschema -i /tmp/$$.network.json network.schema.json

As mentioned above, JSON Schema allows us to validate the grammar of the data model, but referential integrity does not. For instance:

  • We cannot check if the hostnames specified for the core or edge links are valid.

  • While we can check the format of the interface name, we have no means to check if the devices have the interfaces we want to use without connecting to network devices or extracting data from a network management system.

A few words about JSON Schema

Data modeling languages ​​are not for the faint of heart, and JSON Schema is no exception. The intricate presentation of the specification also does not make life much easier (I was more interested in reading the ISO or IEEE standards). Fortunately, the online book Understanding JSON Schema explains all the subtleties pretty well.

Just to give you a taste of what JSON Schema is, here is a JSON document describing the expected host variable data structure retrieved from the Ansible inventory:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "https://www.ipSpace.net/hosts.schema.json",
  "title": "Ansible inventory data",
  "definitions": {
    ...
  },
  "type": "object",
  "patternProperties": {
    ".*" : { "$ref" : "#/definitions/router" }
  },
  "minProperties": 1
}

Here’s what we can say about this circuit:

  • It describes Ansible inventory data;

  • It contains definitions for additional schemas (see below).

  • The top-level element is an object (dictionary) with some properties (we know these are inventory hostnames) and each property must match the schema router

  • The minimum number of properties is one (at least one host in the inventory file).

Schema definition router is in property definitions:

"router" : {
  "type" : "object",
  "properties": {
    "bgp_as": {
      "type": "number",
      "minimum": 1,
      "maximum": 65535
    },
    "hostname": {
      "type": "string"
    }
  },
  "required": [ "bgp_as","hostname" ],
  "additionalProperties": false
}

According to this scheme, a router (more precisely, the Ansible host facts describing a router) is an object with the following properties:

  • Numerical property bgpas, which must be between 1 and 65535;

  • String property hostname

  • Both properties are required and there should be no other properties on the object.

Roll up our sleeves

JSON schemas of the host and network, as well as the source code of the validation script, are available on GitHub. Feel free to clone the repository, change host_vars files or network data model and run the verification script for their own purposes.

You may also want to explore the JSON Schema more, in particular:

  • figure out what the json schema does network;

  • add optional property description into the data model of the router;

  • adjust property validation bgp_asto allow 4-byte AS numbers in dot notation.

You will need the following tools:

More on data validation

We cover data validation and CI / CD pipelines in more detail in Part Validation, Error Handling and Unit Tests our online course Building Network Automation Solutions

Additional Information

Further in the program

  • Data Model Hierarchy

  • Data Model Hierarchy


More details about the course “Network Architect”. View a recording of an open lesson on the topic “Overlay. What is it and why is it necessary” can here

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *