methodology and example

Introduction

Some time ago, many Russian companies whose business is very much tied to the processing and analysis of large volumes of data (banks, retail, telecoms) began to think about how to reduce the cost of owning data warehouses built on Western technologies. Recent events have only accelerated this process. And now the number of companies for which the migration of existing data warehouses built on Oracle, MS SQL and other proprietary DBMSs to open source solutions and domestic suppliers is relevant has grown sharply, and the GreenPlum DBMS is actually becoming the industry standard in data warehouses.

At the same time, both the customer company and the implementing organization need to assess the budget of the migration project. The former usually request a similar assessment from the latter.

This is exactly the task that our client, a large trading company, set for us. After a little familiarization with possible methods, the choice fell on the COSMIC method (Common Software Measurement International Consortium [1]), which is one of the varieties of assessing the functional volume by function points and has grown to the ISO 19761 standard. A plus in favor of COSMIC was the adapted guide developed by the consortium for assessing the functional volume of data warehouses [2].

Briefly about the COSMIC method

The COSMIC method is based on the decomposition of the system functionality into minimal functional processes (Functional User Requirements), in which data movements (object of interest) are considered. One movement – ​​1 CFP (COSMIC Function Point). There are four types of data movements: in, out, read, write.

Image appears as text, screenshot, circle, font Automatically generated description

Entry – external input stream of a data group into a functional process.

Exit – external output stream of a data group from a functional process.

Read – internal input flow of a group of data from the persistent storage area to the functional process.

Write – recording a group of data within a functional process into a permanent storage area.

The data warehouse has layers, the content of which can be represented as a group of corresponding functional processes for moving groups of data (object of interest) from layer to layer. In this case, the data is stored in tables of each layer.

According to the COSMIC method for assessing the functional capacity of a data warehouse [2]filling a table in the data storage layer in a batch process can be represented as follows:

Type FP

A comment

Number of CFPs

Entry

Process start (time or event trigger)

1 CFP

Entry/Exit

Query and receive metadata for correct downloads (for example, with delta downloads, information about recent downloads)

2 CFP

Read or Entry

Reading a group of data from an external source or data warehouse layer table

1 CFP

Write

Transform and write a group of data to a target table in the data warehouse layer

1 CFP

Exit

End of process. Logging the results of a procedure

1 CFP

The following points must be taken into account when assessing:

● If the metadata used in the process lies in the same system, then the metadata request/receive pair can be replaced by one read (1 CFP).

● If several tables are loaded within a batch process (object of interest), then it makes sense to take into account the beginning and end of the process only once, and not for each table.

The final score of the entire project is calculated as the sum of the downloads of all data groups.

Calculation of project labor intensity is calculated using the formula:

Project labor intensity (persons/days) = Functional volume (CFP) / Productivity,

where productivity is the number of CFPs performed by an employee in one day.

An example of functional scope calculation based on a migration project from Oracle to Arenadata DB

Over the course of 6 months, from October 2022 to March 2023, our company migrated from Oracle to Arenadata DB (a commercial build of GreenPlum, hereinafter referred to simply as ADB) a web analytics data warehouse for a large trading company. More details about the project and its results can be found in the report “How we came to Arenadata DB» [5].

Here we will calculate the functional scope of the project. First, let’s decompose the entire project into types of function points. Let’s highlight the main types:

  1. Loading existing input data from the source system of the online store into ADB without changes, with history accumulation (src_hybris table schema).

  2. Loading general reference books and fact tables from Oracle storage into target tables, without changes (kdw schema).

  3. Preliminary calculation of additional attributes and indicators necessary for the formation of target tables of a detailed layer, with saving the results in staging (stg_hybris schema).

  4. Loading online store data into target tables of the detailed data layer (hybris schema).

  5. Data enrichment. Calculation of additional characteristics or formation of new calculated indicators.

  6. Loading history.

  7. Testing – comparing data in the old storage on Oracle with the new one on ADB.

Type 1. Download from the online store’s source system

Table data is copied from the source system without transformation. Metadata about the type of load (full load, delta …) and the results of the previous load are stored in the ArenadataDB itself.

According to the calculation method described above, the total functional volume is:

  • Entry – start of loading according to schedule – 1 CFP;

  • Read – reading metadata to load 59 tables – 59 CFP;

  • Read – reading data from 59 tables of external tables – 59 CFP;

  • Write – writing data to 59 tables – 59 CFP;

  • Write – writing to the metadata subsystem the results of 59 downloads – 59 CFP;

  • Exit – completion of a group of downloads – 1 CFP.

Total – 238 CFP.

Type 2. Loading general references and facts from the old repository

Similarly, ready-made tables are loaded from the Oracle storage by event. The trigger for starting downloads is the completion of their formation at the source. The total number of tables is 44, the functional volume of this group of downloads is 178 CFP.

Type 3. Preliminary calculation of additional attributes and indicators

The pre-calculation load group for additional attributes and metrics implements additional business logic. The trigger for this type of download to begin is the completion of the first two. A qualitative assessment of the functional scope of such downloads can only be carried out if there is a clear understanding of what data is needed to implement the requirements, and at what points they will be read, converted and stored in tables. In migration projects, these requirements are actually already implemented in the old system and can be analyzed from the implementation code or from existing documentation.

For a better understanding, I will give two examples:

1. From the source system, information about user clicks on the online store website comes to the storage. It is represented by a couple of URL lines – from which page the transition was made, plus additional attributes: user, company, date and time of visit, etc. To analyze user behavior on the site and monitor the effectiveness of marketing campaigns, URL strings need to be parsed (select the domain, site page hierarchy, getoptions). Below is the calculation of the functional volume for such a parsing procedure:

Movement type

A comment

Functional volume

Entry

Starting the calculation procedure

1 CFP

Read

Obtaining data sampling parameters from the metadata subsystem

1 CFP

Read

Reading an Array of URL Strings into a Python Procedure

1 CFP

Write

Recording parsing results in the staging table of the stg_hybris schema by generating a text file and then loading it into Arenadata DB using the GPFDIST protocol.

NB The COSMIC Assessment Guide says that intermediate storage structures should not be taken into account in the assessment, so we do not count file writes/file reads in our assessment.

1 CFP

Write

Recording the results of the calculation procedure in the metadata subsystem

1 CFP

Exit

Terminating a procedure and logging the results

1 CFP

Total:

6 CFP

2. Visits to the online store website are carried out not only by real users, but also by robots. Accordingly, the task arises of marking these “mechanical” visits. The company uses different algorithms for this task. The simplest of them is determination by known characteristics (ip address, presence of a known substring in the URL string, etc.)

Below is the calculation of the functional volume for this calculation procedure:

Movement type

A comment

Functional volume

Entry

Starting the calculation procedure

1 CFP

Read

Obtaining data sampling parameters from the metadata subsystem

1 CFP

Read

Reading a Data Array of URL Strings into a pgPLSQL Procedure

1 CFP

Read

Reading the characteristics of famous robots (Determination Algorithm 1)

1 CFP

Read

Determination of robots by behavioral characteristics (Definition algorithm 2)

1 CFP

Write

Recording the results of determining robots using algorithms 1 and 2 in the staging table of the stg_hybris scheme.

1 CFP

Write

Writing the results of the procedure to the metadata subsystem

1 CFP

Exit

Terminating a procedure and logging the results

1 CFP

Total:

8 CFP

Types 4, 5. Loading detailed layer into tables and enriching data.

Loading data into detailed layer tables and further enrichment is carried out similarly to either simple loads or slightly more complex ones, as in the previous example.

Simple loads include loads where the source for the target table can be:

  • existing table;

  • a query that can be framed as a view and actually get one data source.

In this case, the functional volume of such a load will be equal to 6 CFP.

More complex procedures include those that have multiple data sources and multiple target tables. For example, a procedure for clustering clients for different types of users. Clustering is performed separately for each type, so the number of reads is determined by the number of user types. The output is two tables – cluster parameters and the assignment of users to these clusters.

Type 6. Loading History

The history is loaded into the target tables once, almost in a semi-automatic mode. In this regard, in assessing the functional volume, the costs of reading metadata and logging can be neglected. For a rough estimate, we will use a read/write pair for one table. The final estimate will be equal to the number of target tables being migrated multiplied by 2.

Type 7. Testing

The results of migration and the operation of routine loading procedures are verified by testing the data loaded into the old and new systems (both systems work in parallel for some time).

Accordingly, each test consists of movements (3 CFP):

  • reading from the old system;

  • reading from the new system;

  • comparison and recording of comparison results to a permanent storage area.

The final functional size of our project:

Functionality type

Number of downloads/tests

Functional volume, CFP

Extraction from the source system (online store)

59

236

Copying data from an Oracle warehouse

44

176

Preliminary calculation of additional attributes and indicators

5

23

Formation of a detailed layer and data enrichment

80

359

Weekly settlements

8

33

Loading history

80

160

Testing

189

567

Entry/Exit pairs (number of batches)*

3

6

TOTAL:

468

1560

* When determining the total volume, the number of inputs/outputs to processes was calculated by the number of batch loads, in other words, the number of AirFlow DAGs.

Adaptation of the COSMIC model using an ETL framework and test automation

One of the rules of the COSMIC method states: if the same functionality is called several times with different sets of data within the same functional process, then it needs to be counted only once, and not by the number of calls.

To implement data warehouse projects on GreenPlum, our company has developed an ETL framework ([6],[7]), allowing you to reduce the amount of work precisely due to the reuse of functionality.

The ETL framework includes, among others, the following functions:

  • implementation of standard downloads;

  • working with the metadata subsystem;

  • logging.

Typical refers to the movement of data from layer to layer without significant transformations and specific business logic, provided that the data is moved from one source to one destination. In this case, the developer only needs to create the target table and, if necessary, the source view, and enter the corresponding metadata into the ETL framework. Then the download stream will be generated automatically.

Thus, for typical downloads we can estimate the functional volume at 1 CFP, and exclude reading and writing to the metadata subsystem from the assessment of non-typical downloads.

Additionally, writing a small Python procedure that reads two SQL queries – to the old and new data storage, compares the returned data with each other and writes the test result to a separate table, made it possible to reduce the functional volume due to the write operation, i.e. make it standard for all tests.

Thus, thanks to this know-how, we have reduced the functional volume by more than 50%:

Functionality type

Load Type

Number of downloads/tests

Functional volume, CFP

Adjusted function. volume, CFP

Extraction from the source system (online store)

Typical

59

236

59

Copying data from an Oracle warehouse

Typical

42

168

42

Not typical

2

8

4

Preliminary calculation of additional attributes and indicators

Not typical

5

23

13

Formation of a detailed layer and data enrichment

Typical

53

212

53

Not typical

27

147

93

Weekly settlements

Typical

4

16

4

Not typical

4

17

9

Loading history

Typical

80

160

80

Testing

189

567

378

Entry/Exit pairs (number of batches)

3

6

6

TOTAL:

468

1560

741

Conclusion

Any assessment model is, first of all, a mock-up that has its drawbacks. The attentive reader will rightly note that the proposed model does not take into account the complexity of transformations or performance requirements. And the productivity of employees of different qualifications is very variable. This is certainly true. For example, a study conducted by specialists from the Izmir Institute of Technology [4] projects implemented by different teams actually led to an understanding of the incomparability of projected labor costs.

At the same time, the COSMIC model provides a standard methodology by which it is possible to estimate the functional volume in absolute, rather than qualitative, indicators of the project volume. In migration projects, when all loading procedures are already in place and need to be transferred to a new technology stack, the estimate can be quite accurate.

If the functional scope is supplemented with statistics on the implementation of projects by a stable team, then for our company, which provides professional services in the field of creating data warehouses, the COSMIC methodology allows us to assess the upcoming migration work with a high degree of accuracy. This is an obvious benefit for both the customer and us.

List of references/sources

  1. Common Software Measurement International Consortium

  2. Guideline for sizing Data Warehouse and Big Data Software, v1.2

  3. Estimation of Functional Size of a Data Warehouse System using COSMIC FSM Method

  4. Effort Prediction with Limited Data: A Case Study for Data Warehouse Projects

  5. Report How we came to Arenadata DB

  6. Description of the ETL framework: Switch to Greenplum quickly

  7. ETL framework source codes: https://github.com/Sapiens-Solutions/ProPlum

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *