Data governance: adding third-party metadata to Apache Atlas

Governance and resilient data processing are critical success factors in virtually all organizations. While the Cloudera Data Platform (CDP) already supports the entire data lifecycle from ‘Edge to AI’, we at Cloudera are fully aware that enterprises have more systems outside of the CDP. It is very important to avoid CDP becoming another standalone platform in your IT landscape. To fix this, it can be fully integrated into the existing corporate IT environment, no matter how diverse it is, and even help track and categorize a wide range of existing data assets to provide a complete picture from start to finish. In this blog, we highlight the key aspects of CDP that provide data management and lineage, and show how they can be extended to include metadata from non-CDP systems across the enterprise.

SDX (Shared Data Experience)

Apache Atlas, as a fundamental part of SDX in CDP, provides consistent data protection and management across the full range of analytics tools deployed in a hybrid architecture, thanks to Shared Data Experience (SDX) technology. Like CDP itself, SDX is built on open source projects, with Apache Ranger and Apache Atlas taking the lead. Atlas provides the ability to manage metadata and create a unified date catalog, as well as classify and manage these data assets. SDX in CDP takes full advantage of Atlas to automatically track and manage all data assets from all tools on the platform.

Leveraging Atlas Capabilities for Data Assets Outside of CDP

Atlas provides a basic set of predefined typedefs (called typedefs) for various Hadoop and non-Hadoop metadata to meet all CDP needs. But Atlas is an incredibly flexible and customizable metadata framework that allows you to add assets from third-party data sources, even those outside of the CDP.

Everything is built around the basic structure of the metadata model, which consists of type definitions and entities (for more details, see Atlas documentation):

  1. Definitions of each type (typedef)

    • can be inferred from the definition of a supertype

    • can be part of the top class, allowing you to create a tree-like, structured storage for data assets

    • can have an unlimited number of characteristics (attributes) to save all the necessary descriptions

    • can define a valid set of classification definitions that can subsequently be added to each entity of a given typedef. In the following example, we are using a specific server for the type ‘database_server’. Classifications can also be used to indicate whether a table contains Personally Identifiable Information (PII).

  2. Objects are examples of a specific typedef and:

  • can be related to each other

  • can be associated with any number of classifications. For example, each application or use case can be assigned a unique classification; the example below uses “xyz” as the application. Once added, related objects can be directly linked to the classification, giving a clear understanding of the artifacts and how they relate to each other.

Finally, Atlas provides a rich set of REST APIthat can be used for:

  • managing basic typedefs and classifications

  • object management (entity typedef)

  • managing relationships between objects

Extending the Atlas Metadata Model

The following steps describe how Atlas can be extended to add metadata from third sources. At various stages, ready-made scripts from Github repository

Sketch of an end-to-end data line.

Below is a very simple but common ETL pipeline scenario:
The source system (for example, a transactional application for a banking application) sends the data file in CSV to some storage (not HDFS). The ETL process then reads the file, performs some quality checks, and loads the verified records into the DBMS as well as the Hive table. Problem entries are saved in a separate error file.

To capture this end-to-end data flow in the Atlas, we need the following typedefs:

Subjects:
– Server

Assets (typedef):
– Files
– Table in the DBMS
– Hive table * please note that this asset is already available in the Atlas in CDP as an integral part of the CDP platform. There is no need to create a typedef, but we will show how third-party assets can connect to CDP assets to build end-to-end traceability.

Processes:
– File transfer process
– ETL / DB loading process

2. Determination of the required type definitions (typedef’s).

From a design standpoint, a typedef is similar to a class definition. There are predefined typedefs for all assets that are used in CDP, for example Hive tables. Definitions that don’t exist out of the box can be defined using the following syntax in a simple JSON file. In the example 1_typedef-server.json describes the server typedef used in this blog.

Type: server
Derived Form: ENTITY
Special characteristics for this typedef:
– hostname (host_name)
– ip_address (ip_address)
– zone (zone)
– platform
– rack (rack_id)

3. Adding typedefs via REST API to Atlas

To improve the reliability of CDP, all Atlas hooks use Apache Kafka as the asynchronous transport layer. However, Atlas also provides its own rich set of RESTful APIs. In this step, we are using exactly those REST API v2 endpoints – the documentation for the complete REST API endpoint can be find hereand curl will be used to call the REST API.

Note: Optionally, you can use a local docker-based installation for the first steps:

docker pull sburn/apache-atlas:latest
docker run -d -p 21000:21000 --name atlas 
sburn/apache-atlas 
/opt/apache-atlas-2.1.0/bin/atlas_start.py

Typedef JSON request is stored in a file 1_typedef-server.jsonand we call the REST endpoint with the following command:

curl -u admin:admin -X POST -H 
Content-Type:application/json -H Accept:application/json -H C
ache-Control:no-cache 
http://localhost:21000/api/atlas/v2/types/typedefs -d 
@./1_typedef-server.json

You can also use the following bash script to create all the required typedefs for the entire data pipeline (create_typedef.sh):

4. Checking the Atlas interface after adding types and classifying external sources to make sure new entities have been added

The new types are grouped under the “3party” object.

Also new classifications have been added:

5. Creation of the “server” entity

To create an entity, use the REST API “/ api / atlas / v2 / entity / bulk” and refer to the appropriate typing (for example, “typeName”: “server”).

Good to know: Create vs Modify. Each typedef defines which fields should be unique. If you submit a query where these values ​​are not unique, the existing instance (with the same values) will be updated rather than inserted.

The following command shows how to create a server principal:

curl -u admin:admin -H Content-Type:application/json -H Accept:application/json http://localhost:21000/api/atlas/v2/entity/bulk -d '
  {
  "entities": [
    {
      "typeName": "server",
      "attributes": {
        "description": "Server: load-node-0 a landing_zone_incoming in the prod environment",
        "owner": "mdaeppen",
        "qualifiedName": "load-node-0.landing_zone_incoming@prod",
        "name": "load-node-0.landing_zone_incoming",
        "host_name": "load-node-0",
        "ip_address": "10.71.68.009",
        "zone": "prod",
        "platform": "darwin19",
        "rack_id": "swiss 1.0"
      },
      "classifications": [
        {"typeName": "landing_zone_incoming"}
      ]
    }
  ]
  }'

Script create_entities_server.sh from the github repository illustrates creating a server principal using a generic script with some parameters. The output is the GUID of the created / modified artifact (for example, SERVER_GUID_LANDING_ZONE = f9db6e37-d6c5-4ae8-976c-53df4a55415b).

SERVER_GUID_LANDING_ZONE=$(./create_entities_server.sh 
-ip 10.71.68.009  <-- ip of the server (unique key)
-h load-node-0  <-- host name of the server
-e prod  <-- environment (prod|pre-prod|test)
-c landing_zone_incoming) <-- classification

6. Creation of an entity of the “datafile” type

Similar to creating a server principal, use the “/ api / atlas / v2 / entity / bulk” REST API again and refer to the “dataset” type.

curl -u admin:admin -H Content-Type:application/json -H Accept:application/json http://localhost:21000/api/atlas/v2/entity/bulk -d '
  {
  "entities": [
    {
      "typeName": "dataset",
      "createdBy": "ingestors_xyz_mdaeppen",
      "attributes": {
        "description": "Dataset xyz-credit_landing.rec is stored in /incommingdata/xyz_landing",
        "qualifiedName": "/incommingdata/xyz_landing/xyz-credit_landing.rec",
        "name": "xyz-credit_landing",
        "file_directory": "/incommingdata/xyz_landing",
        "frequency":"daily",
        "owner": "mdaeppen",
        "group":"xyz-credit",
        "format":"rec",
        "server" : {"guid": "00c9c78d-6dc9-4ee0-a94d-769ae1e1e8ab","typeName": "server"},
        "col_schema":[
          { "col" : "id" ,"data_type" : "string" ,"required" : true },
          { "col" : "scrap_time" ,"data_type" : "timestamp" ,"required" : true },
          { "col" : "url" ,"data_type" : "string" ,"required" : true },
          { "col" : "headline" ,"data_type" : "string" ,"required" : true },
          { "col" : "content" ,"data_type" : "string" ,"required" : false }
        ]
      },
      "classifications": [
        { "typeName": "xyz" }
      ]
    }
  ]
  }'

Script create_entities_file.sh from github repository shows how to create dataset entity and return GUID for each file.

CLASS="systemOfRecord"
APPLICATION_ID="xyz"
APPLICATION="credit"
ASSET="$APPLICATION_ID"-"$APPLICATION"
FILE_GUID_LANDING_ZONE=$(./create_entities_file.sh 
 -a "$APPLICATION_ID" 
 -n "$ASSET"_"landing"  <-- name of the file
 -d /incommingdata/"$APPLICATION_ID"_landing  <-- directory
 -f rec  <-- format of the file
 -fq daily 
 -s "$ASSET" 
 -g "$SERVER_GUID_LANDING_ZONE"  <-- guid of storage server
 -c "$CLASS")

7. Maintaining information about classifications associated with applications

In order not to lose information about which entity is which, if there are many of them, we can create an additional classification for each application. Script create_classification.sh will help us create an additional classification for each application that can be used to link all assets to it.

CLASS="systemOfRecord"
APPLICATION_ID="xyz"
APPLICATION="credit"
ASSET="$APPLICATION_ID"-"$APPLICATION"
$(./create_classification.sh -a "$APPLICATION_ID")

REST endpoint call:

curl -u admin:admin -H Content-Type:application/json -H Accept:application/json http://localhost:21000/api/atlas/v2/types/typedefs -d '{
  "classificationDefs": [
    {
      "category": "CLASSIFICATION",
      "name": "xyz",
      "typeVersion": "1.0",
      "attributeDefs": [],
      "superTypes": ["APPLICATION"]
    }
  ]
  }'

8. Building a relationship between assets

For the data pipeline assets we designed and built above, we need two different types for the processes that connect them:

# add file transfer "core banking" to "landing zone"
FILE_MOVE_GUID=$(./create_entities_dataflow.sh 
 -a "$APPLICATION_ID" 
 -t transfer  <-- type of process
 -ip 192.168.0.102  <-- execution server
 -i "$ASSET"_"raw_dataset"  <-- name of the source file
 -it dataset  <-- type of the source
 -ig "$FILE_GUID_CORE_BANING"  <-- guid of the source
 -o "$ASSET"_"landing_dataset"   <-- name of the target file
 -ot dataset  <-- type of the target
 -og "$FILE_GUID_LANDING_ZONE"  <-- guid of the target
 -c sftp) <-- classification
echo "$FILE_MOVE_GUID"
# add etl "landing zone" to "DB Table"
FILE_LOAD_GUID=$(./create_entities_dataflow.sh 
 -a "$APPLICATION_ID" 
 -t etl_load  <-- type of process
 -ip 192.168.0.102  <-- execution server
 -i "$ASSET"_"landing_dataset"  <-- name of the source file
 -it dataset  <-- type of the source
 -ig "$FILE_GUID_LANDING_ZONE"  <-- guid of the source
 -o "$ASSET"_"database_table"    <-- name of the target file
 -ot db_table  <-- type of the target
 -og "$DB_TABLE_GUID"  <-- guid of the target
 -c etl_db_load) <-- classification
echo "$FILE_LOAD_GUID"

9. Putting It All Together

We now have all the pieces of the puzzle assembled. Script sample_e2e.sh shows you how to put them together to create an end-to-end data line. The pipeline can also contain assets that were already CDP, you just need to establish a connection between them (as shown above).

Sequencing:

  • Create a unique classification for this application

  • Create the required server entities

  • Create the necessary dataset entities on the previously created servers (Mainframe, Landing zone).

  • Create the required entities of the database tables on the previously created database server

  • Create a process with type ‘transfer’ between dataset Mainframe> Landing

  • Create a process with type ‘etl_load’ between Landing zone> DB table

  • Create process with type ‘etl_load’ between Landing zone> HIVE table

  • Create process with type ‘etl_load’ between Landing zone> Error dataset

The scenario described above occurs every now and then in almost all companies in this or a similar form. Atlas is a highly flexible metadata catalog that can be adapted for all kinds of assets. By integrating third-party assets, it delivers true added value by better illustrating existing data flows. The connections between all assets are critical to assessing the impact of change, or simply to understand what’s going on. I recommend taking a “start small” approach and recording the origin of each dataset as it connects to the CDP or during maintenance. Take advantage of what is already there and complete the picture over time.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *