How to approach the implementation of DWH so that it does not “hurt”? Which methodologies to use and which stack to choose?

.

Does DWH come out of the box?

DWH can be built on proprietary or open source software, hosted in the cloud or on-premise.

At the same time, it is more effective to use a combination of different tools, since there is no ideal out-of-the-box solution for building a storage facility.

Proprietary software most often requires the deployment of additional software products, for example, when using solutions such as Arenadata, it is necessary to deploy an orchestrator, and the vendor himself recommends the open source Airflow tool for this.

Cloud solutions can provide maximum fault tolerance and scalability, however, their use depends on the company’s security policy.

The most difficult option to support is the option of building your own infrastructure. Despite the financial benefits, this option requires regular product updates and the involvement of an extensive staff (DataOps, SRE engineers).

Which stack to choose?

The greatest variety of products for building QCD is offered by open source software, while the same task on a project can be solved by several tools.

When developing an enterprise data warehouse project, it is worth using scalable technologies and hybrid methodologies, adapting best practices to unique business problems.

Tools for building a data warehouse are selected individually depending on the goals, requirements and capabilities of each company.

In this article, we share the stack that we use in enterprise data warehouse implementation projects.

We most often choose the following tools:

For extraction

  • Airbyte – open source tool for replication, data integration between the source system and the staging storage layer. Provides flexibility, a user-friendly interface and the ability to create Custom API uploads.

  • Apache Kafka – open source messaging system for processing streaming data in real time. Provides high scale-out throughput, fault tolerance and performance.

  • Debezium – open source platform, a set of connectors for CDC (Capture Data Change) – capturing and delivering data changes in sources.

For processing

  • Dagster – orchestrator of ETL processes and data processing pipelines with open source code. Modern, easy to use, allows you to build complex data pipelines.

  • DBT – an open source framework for data transformation and documenting data processing processes based on SQL templates, facilitates the creation of a storage model and automates data transformation.

For storage

  • PostgreSQL and Greenplum – DBMS suitable for working with transactional data

  • СlickHouse – A DBMS that has proven itself as a showcase for reporting and, if necessary, fast calculations. We use it for storage in cases where there are no transaction requirements; data can be uploaded in parts, and they can be easily overwritten.

For accompaniment

  • MinIO – high-performance open source object storage server for organizing object storage

  • Docker – a platform for developing, delivering and launching container applications for organizing the process of continuous deployment of CI/CD (Continuous Integration & Continuous Delivery)

  • ELK Stack – a set of tools that provide collection, storage and processing of logs

Metadata

To collect metadata and render processes in a visual component, select a platform Open Metadatademonstrating maximum flexibility and ease of use among other open source data catalogs and providing the ability to quickly create custom integrations.


To avoid “pain” when building or refactoring DWH, you must:

  • Define yourself in terms of the situation now + maturity level (AS IS)

  • Record the true challenges – will storage really solve your problems, what needs to be fixed before implementing DWH

  • Agree on the Defenition of Done with stakeholders

  • Calmly design a solution

  • Implement it in a disciplined manner

Where to start, depending on the level of data management maturity in the company:

If you are at the Beginning level of maturity

If you are at the Minimum Maturity Level

If you are at the Established Maturity Level

If you are at a Managed or Optimized maturity level

– Audit of the company’s internal and external processes, sources and types of data

– Master data management to optimize data processing

– Setting goals

– Drawing up a project roadmap

DWH Implementation:

– Concept creation

– Selection of methodology and stack

– Building architecture

– Testing

– Support

– Strengthening the DWH project team

– Audit of existing storage

– Refactoring of individual modules, for example MDM system

to improve performance, fix bugs, and expand storage capabilities

Applying new data management practices

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *