Top 10 misconceptions about porting Hadoop to the cloud

Many companies and corporations want to use the cloud to process data for obvious reasons: flexibility, scalability, you can only pay for what you use and so on.

In fact, transferring a project with a multicomponent data processing system, of the Petabyte scale, from the on-premises environment to the cloud is a solid “but”. There are many products for migration: Hadoop, Hive, Yarn, Spark, Kafka, Zookeeper, Jupyter, Zeppelin. Given the fundamental difference in the environment, it is easy to get lost and make mistakes in this variety.

In this article I will talk about common misconceptions and give some tips on quality migration to the cloud. I personally use Aws, but all the tricks are relevant for other providers with similar solutions, for example, Azure or Gcp.

1. Copying data to the cloud is easy

Transfer multiple petabytes of data to the cloud for general use (e.g. S3), which in our case will work as a data lake, is not an easy task. This can be very time consuming and resource intensive.

Despite the huge number of solutions, commercial and open source, I did not find a single one that would cover all the needs:

  • transmission
  • data integration
  • data verification
  • reporting

If a certain part of the data is predominantly static or moderately dynamic, you can use a solution like AWS Snowball, which allows you to copy arrays to a physical device. Data will be downloaded from your local network, after which the drive will be sent back to the data center Aws and fill the data in the storage S3.

Hidden text

Please note that depending on the amount of data, you may need several AWS devices.

It’s good practice to divide data transfer into two phases. After most of the array has been sent and uploaded to the repository, use a direct connection from the cloud provider to unload the remainder. You can use the methods for this. Hadoop distCP or Kafka mirroring. Both methods have their own nuances. DistCP It requires constant planning and deep tuning, in addition, not all objects can be placed in black and white lists. Kafka MirrorMakerIn addition to deep customization, it needs to export metrics through management extension Jmx for measuring throughput, latency and overall stability.

Hidden text

Even harder to verify this data in the cloud storage facilities. In some cases, special tools may be needed – for example, to create a custom data catalog with hashing the name of the object.

2. Cloud works just like local storage

Local storage and cloud storage are not the same thing. A good example is Zookeeper and Kafka. Client library Zk caches allowed server addresses Zk Lifetime: This is a big problem for deploying in the cloud that will require crutches – static network interfaces ENI for servers Zk.

A series of non-functional tests would be nice to control performance. Nft in the cloud infrastructure to make sure that the settings and configuration will cope with your workloads.

Hidden text

Remember that you work in a collaborative environment, and other users can take over some of the resources.

3. Object storage 100% replaces HDFS

Separating storage and computing layers is a great idea, but there is a caveat.

With the exception of Google cloud storagewhich uses strong data consistency (strong consistency), most of the rest of the object repositories work according to the “ultimately consistency” (eventually consistent). This means that they can be used to enter raw and processed data, and to output results, but not as a temporary storage.

Hidden text

If you need multi-node access, you will have to turn to HDFS.

4. You can deploy cloud infrastructure from the user interface

For a small test environment, this can be easy, but the higher the infrastructure requirements, the more likely it is to write code. You might want to have several environments (Dev, QA, Prod). This can be implemented using Cloudformation and Terraform, but you won’t succeed in copying the necessary pieces of code; you will have to redo a lot for yourself.

Hidden text

A good option is to use CI / CD for various tests at different stages and deploy custom code in the final stage. This will facilitate the task, but obviously will not reduce it to a couple of clicks in a convenient control panel.

5. For correct visibility in the cloud you just need to use $ {SaaS_name}

Good visibility (logging and monitoring) of the old and new environment is a critical condition for successful migration.

This can be difficult due to the use of different systems in environments. For instance, Prometheus and ELK for the local environment, and NewRelic and Sumologic for the cloud. Even if one Saas The solution is applied in both environments – it is difficult to scale.

Hidden text

You will have to puzzle over how to debug the export and processing of application metrics (extract the metric, for example, JMX, configure dashboards and aggregation, create alerts).

6. The cloud scales to infinity

Users often rejoice as children when they learn about the automatic scaling function and think that they will immediately apply it on their data processing platforms. It is really easy to configure for nodes EMR without HDFS, but will require additional knowledge for persistent storage (for example, a software broker Kafka) Before switching all traffic to the cloud infrastructure, you need to check the current resource limits: the number of class instances, disks, you also need to preheat the load balancers. Without such training, the working potential cannot be used as it should.

Hidden text

Also remember that scaling is not a free option, and the budget is not an infinite amount.

7. I just move my infrastructure unchanged

Indeed, instead of focusing solely on the capabilities of a potential service provider, it is better to focus on your own repositories, for example, Dynamodb. But do not forget about services compatible with the API. Alternatively, you can use the cloud service. Amazon rds for database Hive metastore.

Another good example is the cloud-optimized big data platform. EMR. At first glance, simple, it requires fine-tuning using post-installation scripts. You can customize the heap size. (heap size)third-party archives Jar, UDFsecurity add-ons. I also note that there are still no ways to ensure high availability. (HA) for main nodes Namenode or YARN ResourceManager.

Hidden text

This means that you yourself must build a pipeline architecture to ensure the right uptime.

8. Transfer Hadoop / Spark tasks to the cloud – it’s easy

Not really. To successfully transfer tasks, you need to have a clear idea of ​​your business logic and pipelines: from the initial receipt of raw data to high-quality arrays. Everything becomes even more complicated when the results of the pipelines X and Y are the input data of the pipeline Z. All components of the flows and relationships should be displayed as clearly as possible. This can be implemented using DAG.

Hidden text

Only after that it will be possible to switch to analytical pipelines within the framework of business SLA.

9. Cloud will reduce operating costs and staff budget

Own equipment requires physical costs and salaries for employees. After moving to the cloud, all costs will not disappear: you still have to respond to the needs of the business and hire people who will be involved in the development, support, troubleshooting, budget planning. You will also need to invest in software and tools for the new infrastructure.

The staff should be a person who understands how new technologies work. This implies a highly qualified employee. Therefore, even taking into account the reduction in staff, you can spend as much, if not more, on the salary of one good specialist.

Hidden text

Another budget item is licensing and service fees (for example, for EMR), which can be quite high. Without detailed planning and modeling, it turns out that the cost of the cloud has increased significantly compared to the physical equipment that was used before.

10. No-Ops close …

No-Ops is the dream of any business. A fully automated environment without the need for services and products from third parties. Is it possible?

A modest team of several people is relevant only for small companies whose activities are not directly related to data. Everyone else will need at least a specialist who integrates and packs all the systems, compares them, automates, provides visibility and eliminates all the bugs that arise along the way.

Hidden text

Data-Ops do not disappear, just go to a new level.


To summarize. Transferring data processing pipelines to the cloud is good. For migration to work as it should, you need to carefully plan the process, taking into account all the pitfalls described above. Think a few steps forward and everything will work out.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *