Hello, Habr! My name is Kirill, I am a solution engineer at Cloudera, and today I had the honor to represent the entire team working with the CIS region. We are very glad that we can finally share useful materials and news from the world of big data with you. Recently, we have a lot of new things, so starting to write this article we were worried that it would not turn into an overwhelming longread. We tried to collect only the most basic below and, unfortunately, this article will not contain a lot of technical information, but we will quickly fix it.
What’s new in Cloudera?
Perhaps, let’s start a little from afar for those who are not so actively following the development of projects in the Hadoop ecosystem: Hortonworks and Cloudera merged in 2019 under the common name Cloudera. From that moment on, a new branch in the history of the development of the Hadoop distribution began, as through the efforts of the common team, work began on a new assembly, which included all the best of both worlds. In 2019, the first release of the new Cloudera Data Platform distribution (hereinafter – CDP) took place, which included more than 50 best-in-class open source tools for working with big data.
So what is so interesting about Cloudera Data Platform? Within the platform, we provide an enterprise data cloud for data of any type, in any infrastructure, from the edge to AI. CDP works in a variety of environments: on-premises, private and public clouds, or in a hybrid architecture.
Now in more detail about the names of all variants of the distribution. The version for traditional local hardware installation is called CDP Private Cloud Base. It is the foundation for expanding the on-premises architecture to a private cloud (hence its name). The full-fledged private cloud architecture, which includes the Base part (storage tier) and analytical applications on Kubernetes (compute tier), is called CDP Private Cloud Plus / Max. With the version for public clouds, everything is easier – CDP Public Cloud. At the same time, it is a full-fledged PaaS, tightly integrated with the native services of the big three: AWS, Azure and GCP.
Thanks to a single control panel, the Cloudera SDX (Shared Data Experience) framework and an unchanged set of services, the platform looks the same regardless of the deployment environment, which allows for a full hybrid architecture. At the same time, the set of available services allows you to work with data of any type, from the periphery to AI, while ensuring corporate-level security (encryption of data on the way and at rest, complete kerberization of the cluster) and data governance:
Also in the toolbox itself there are interesting new items:
– Since December 2020, Spark 3.0 has become available to all CDP users, and the 3.1 addition is scheduled for the first half of 2021.
– At the end of last summer, a modified and production-ready Apache Ozone was added to the distribution – S3 compatible object storage, a kind of successor to HDFS, which closes many of its weak points and allows much denser node configurations (we tested 350TB on node – stable operation of all loads).
– After the acquisition of Arcadia Data, a full-fledged BI component Cloudera Data Visualization appeared in the stack, which works with all major data analytics engines: Hive / Impala, Solr, Druid.
– The acquisition of Eventador in 2020 made it possible to add the functionality of streaming data analytics using SQL based on Flink – now you can work with data streams from Kafka as with standard tables in a DBMS and create materialized views for, for example, transferring transformed streams back to Kafka.
– Earlier this year, Cloudera announced the inclusion of the Apache Iceberg project in the distribution, which will allow even more flexibility in working with huge datasets thanks to snapshots, support for schema evolution and the ability to rollback to previous versions in time.
Initially, the private cloud architecture was supported only on the basis of the Red Hat OpenShift platform, but CDP Private Cloud Plus is coming out soon with support for embedded vanilla kubernetes, which will greatly simplify the installation and accelerate the implementation of the hybrid architecture. Users will be able to get started with data faster, get all the benefits of cloud infrastructure, and at the same time the data will be stored in the local data center.
As you can see Cloudera’s Hadoop distribution is actively developing and evolving, we have big plans for this year. In the end, we would like to immediately answer a couple of questions that you might have while reading this article.
Is there a free version of the distro like it used to be with HDP / CDH?
A free version for commercial use of the CDP distribution is not planned. At the moment, you can download the trial version from the site or get a temporary license through the manager’s account, and we are also considering a possible release of the version for educational purposes in the future.
What about all your favorite HDP / CDH builds?
These distributions will not be updated and are gradually ending their support life cycle (HDP2x / CDH5x have already finished by the end of 2020, the same fate will overtake HDP3 / CDH6 soon). Moreover, the repositories of even these versions are no longer available for public access – this now also requires a license.
The text mentioned the AI that the platform offers for working with MO models other than Zeppelin?
The distribution includes an additional component – Cloudera Machine Learning (also known as Cloudera Data Science Workbench), which is responsible for organizing the full cycle of work on ML models. It is a full-fledged MLOps platform based on kubera with a central metadata repository, model versioning, the ability to work together in any IDE (Jupyter Lab / Notebook is included by default) and any libraries, secure connection to the main cluster and the ability to embed ready-made models as functions into business processes through REST API.
Please leave your comments on the article, what other questions about our products and technologies would you be interested to discuss?