The combination of these factors has led to a change in data management processes. Data Platform – an approach that offers a rethinking of the traditional concept of classical data warehouse (QCD) using Big Data technologies and new approaches used to build Data Lake platforms. The Data Platform allows you to qualitatively take into account such important factors as the growth in the number of users, the requirements for time2customer (to provide the possibility of high speed of implementation of changes), as well as the cost of the resulting solution, including taking into account its further scaling and development.
In particular, we propose to consider the experience of automation of reporting under RAS, tax reporting and reporting at Rosfinmonitoring at the National Clearing Center (hereinafter – NCC).
The choice of architecture that allows you to implement the solution, taking into account the following requirements, was extremely careful. The competition was attended by both classic solutions and several “bigdate” ones – at Hortonworks and the Oracle Appliance.
The main requirements for the solution were:
• Automate the construction of regulatory reporting;
• At times increase the speed of data collection and processing, construction of final reports (direct requirements for the time of building all reporting for the day);
• Relieve ABS by moving reporting processes beyond the general ledger;
• Choose the best solution from a price point of view;
• Provide users with a comfortable, flexible, customizable solution in terms of reporting;
• Get a scalable fault tolerant system compatible with the technology stack of the MB group of companies.
A decision was made in favor of introducing the Neoflex Reporting Big Data Edition product based on the open-source Hadoop Hortonworks platform.
The DBMS of the source systems is Oracle, also the sources are flat files of various formats and images (for tax monitoring purposes), individual information is downloaded via the REST API. Thus, the task of working with both structured and unstructured data appears.
Let’s take a closer look at the storage areas of the Hadoop cluster:
Operation Data Store (ODS) – data is stored “as is” of the source system, in the same form and format as determined by the source system. To store history for a number of necessary entities, an additional archive data layer (ADS) is implemented.
There is no separate audit log suitable for delta extraction on the side of source systems. Under such initial conditions, the selection of all records or delta allocation on the Hadoop side of the cluster looks optimal.
When processing delta (and supporting historicity) within a cluster, the following should be considered:
• Due to the append-only mode of data storage, historical options that require updating existing data are not applicable, or require complex logic;
• For entities for which the historicity of data is already supported in the source system, it is necessary to store a complete slice of historicity, because changes in past dates are possible, and the full state of the history on the settlement date must be taken into account;
• Work with historicity with a single date of relevance requires either the selection of only relevant records within the primary key, or the creation of intermediate “caching” storefronts;
• In the absence of an industrial CDC facility, loading data from the source with “slices” will be faster than highlighting the “delta” at the load level and then loading only the selected “delta”.
Based on the above, it was optimal not to allocate a delta, so the following approach was chosen:
• Implemented an ODS layer storing SI data in the form of AS IS. Thus, there is no load on the SI, in addition to single readings for loading into the Hadoop cluster.
• In ODS, historicity at the level of non-transactional SI data is organized through the archive data layer, transactional data is downloaded one-time on a daily basis (partitioned).
• Data in PDS is built on the basis of historical and non-historical SI data in the form of “1 slice of actual data for 1 date” in the context of each of the PDS entities.
Portfolio Data Store (PDS) is an area in which critical data is prepared and stored in a unified centralized format, which is subject to increased demands on the quality of not only data, but also the structure of syntax and semantics. For example, data include customer registers, transactions, balance sheets, etc.
ETL processes are being developed on Spark SQL using Datagram. It belongs to the class of “accelerator” solutions, and allows you to simplify the development process through visual design and description of data transformations using the usual SQL syntax – and in turn, the code of the work itself in the Scala language is automatically generated. Thus, the level of development complexity is equivalent to developing ETLs on more traditional and familiar tools such as Informatica and IBM InfoSphere DataStage. Therefore, this does not require additional training of specialists or involvement of experts with special knowledge of Big Data technologies and languages.
At the next stage, reporting forms are calculated. The calculation results are placed in the windows of the Oracle DBMS, where interactive reports are built on the basis of Oracle Apex. At first glance, it may seem counterintuitive to use commercial Oracle along with open-source Big Data technologies. Based on the following factors, it was decided to use Oracle and Apex specifically:
• Lack of an alternative BI-solution compatible with a free-distributed DBMS and meeting the requirements of the NCC Business in terms of building on-screen / printed forms of regulatory reporting;
• Use of Oracle for DWH involved as source systems for the Hadoop cluster;
• Availability of a flexible Neoflex Reporting platform on Oracle, which has the majority of regulatory reports and is easily integrated with the BigData technology stack;
The Data Platform stores all the data from the source systems, unlike the classic QCD, where data is stored for solving specific problems. At the same time, only useful, necessary data are used, described, prepared and managed in the Data Platform, i.e. if certain data is used on an ongoing basis, they are classified according to a number of signs and placed in separate segments, portfolios in our case, and managed according to the characteristics of these portfolios. In QCD, on the contrary, all data uploaded to the system are prepared, regardless of the need for their further use. Therefore, if it is necessary to expand to a new class of tasks, QCD often faces an actually new implementation project with the corresponding T2C, while in the Data Platform all data is already in the system and can be used at any time without preliminary preparation. For example, data is collected from ODS, quickly processed, “screwed” to a specific task and transmitted to the end user. If direct use has shown that the functionality is correct and applicable in the future, then the full process is launched, in which the target transformations are built, data portfolios are prepared or enriched, the storefront layer is activated and full interactive reports or uploads are built.
The project is still under implementation, however, we can note a number of achievements and take intermediate results:
1. The classic tasks of forming regulatory reporting have been solved:
– Implemented multi-functional interactive reports and their uploads in various formats that support the requirements of regulators;
– Configured a flexible role model using LDAP authorization;
– The required reporting speed has been achieved: 35 minutes to download from sources and build data portfolios in HDFS, another 15 minutes to build all reporting forms (50 pcs. At the time of writing) within one business day;
– The administrator’s console has been set up to easily manage data loading without understanding the details of hdfs and the entire Big Data zoo;
– The coverage of the data quality control system has been expanded to include the data portfolio area (PDS) of the Hadoop cluster;
2. Confirmed the reliability and fault tolerance of the system due to the standard advantages of a distributed file system Hadoop;
3. The use of open-source solutions, including Hadoop and Spark, reduced hardware costs (for large data projects, you can use standard mid-range and entry-level performance servers, combining them into easily scalable clusters) and software. Thus, it was possible to reduce the total cost of ownership of the solution for working with big data regarding solutions based on traditional QCD;
4. Centralization of only “useful” system data and at the same time, the availability of all other data for local or operational tasks also reduced the cost of preparing and maintaining the entire system;
5. Thanks to the Datagram and the availability of all SI data collected together, NCC has the opportunity through visual design to quickly build new ETL processes and conduct development on their own without involving external vendors.