Top 10 misconceptions about porting Hadoop to the cloud

Many companies and corporations want to use the cloud to process data for obvious reasons: flexibility, scalability, you can only pay for what you use and so on.
In fact, transferring a project with a multicomponent data processing system, of the Petabyte scale, from the on-premises environment to the cloud is a solid “but”. There are many products for migration: Hadoop, Hive, Yarn, Spark, Kafka, Zookeeper, Jupyter, Zeppelin. Given the fundamental difference in the environment, it is easy to get lost and make mistakes in this variety.
In this article I will talk about common misconceptions and give some tips on quality migration to the cloud. I personally use Aws, but all the tricks are relevant for other providers with similar solutions, for example, Azure or Gcp.
1. Copying data to the cloud is easy
Transfer multiple petabytes of data to the cloud for general use (e.g. S3), which in our case will work as a data lake, is not an easy task. This can be very time consuming and resource intensive.
Despite the huge number of solutions, commercial and open source, I did not find a single one that would cover all the needs:
- transmission
- data integration
- data verification
- reporting
If a certain part of the data is predominantly static or moderately dynamic, you can use a solution like AWS Snowball, which allows you to copy arrays to a physical device. Data will be downloaded from your local network, after which the drive will be sent back to the data center Aws and fill the data in the storage S3.
It’s good practice to divide data transfer into two phases. After most of the array has been sent and uploaded to the repository, use a direct connection from the cloud provider to unload the remainder. You can use the methods for this. Hadoop distCP or Kafka mirroring. Both methods have their own nuances. DistCP It requires constant planning and deep tuning, in addition, not all objects can be placed in black and white lists. Kafka MirrorMakerIn addition to deep customization, it needs to export metrics through management extension Jmx for measuring throughput, latency and overall stability.
2. Cloud works just like local storage
Local storage and cloud storage are not the same thing. A good example is Zookeeper and Kafka. Client library Zk caches allowed server addresses Zk Lifetime: This is a big problem for deploying in the cloud that will require crutches – static network interfaces ENI for servers Zk.
A series of non-functional tests would be nice to control performance. Nft in the cloud infrastructure to make sure that the settings and configuration will cope with your workloads.
3. Object storage 100% replaces HDFS
Separating storage and computing layers is a great idea, but there is a caveat.
With the exception of Google cloud storagewhich uses strong data consistency (strong consistency), most of the rest of the object repositories work according to the “ultimately consistency” (eventually consistent). This means that they can be used to enter raw and processed data, and to output results, but not as a temporary storage.
4. You can deploy cloud infrastructure from the user interface
For a small test environment, this can be easy, but the higher the infrastructure requirements, the more likely it is to write code. You might want to have several environments (Dev, QA, Prod). This can be implemented using Cloudformation and Terraform, but you won’t succeed in copying the necessary pieces of code; you will have to redo a lot for yourself.
5. For correct visibility in the cloud you just need to use $ {SaaS_name}
Good visibility (logging and monitoring) of the old and new environment is a critical condition for successful migration.
This can be difficult due to the use of different systems in environments. For instance, Prometheus and ELK for the local environment, and NewRelic and Sumologic for the cloud. Even if one Saas The solution is applied in both environments – it is difficult to scale.
6. The cloud scales to infinity
Users often rejoice as children when they learn about the automatic scaling function and think that they will immediately apply it on their data processing platforms. It is really easy to configure for nodes EMR without HDFS, but will require additional knowledge for persistent storage (for example, a software broker Kafka) Before switching all traffic to the cloud infrastructure, you need to check the current resource limits: the number of class instances, disks, you also need to preheat the load balancers. Without such training, the working potential cannot be used as it should.
7. I just move my infrastructure unchanged
Indeed, instead of focusing solely on the capabilities of a potential service provider, it is better to focus on your own repositories, for example, Dynamodb. But do not forget about services compatible with the API. Alternatively, you can use the cloud service. Amazon rds for database Hive metastore.
Another good example is the cloud-optimized big data platform. EMR. At first glance, simple, it requires fine-tuning using post-installation scripts. You can customize the heap size. (heap size)third-party archives Jar, UDFsecurity add-ons. I also note that there are still no ways to ensure high availability. (HA) for main nodes Namenode or YARN ResourceManager.
8. Transfer Hadoop / Spark tasks to the cloud – it’s easy
Not really. To successfully transfer tasks, you need to have a clear idea of your business logic and pipelines: from the initial receipt of raw data to high-quality arrays. Everything becomes even more complicated when the results of the pipelines X and Y are the input data of the pipeline Z. All components of the flows and relationships should be displayed as clearly as possible. This can be implemented using DAG.
9. Cloud will reduce operating costs and staff budget
Own equipment requires physical costs and salaries for employees. After moving to the cloud, all costs will not disappear: you still have to respond to the needs of the business and hire people who will be involved in the development, support, troubleshooting, budget planning. You will also need to invest in software and tools for the new infrastructure.
The staff should be a person who understands how new technologies work. This implies a highly qualified employee. Therefore, even taking into account the reduction in staff, you can spend as much, if not more, on the salary of one good specialist.
10. No-Ops close …
No-Ops is the dream of any business. A fully automated environment without the need for services and products from third parties. Is it possible?
A modest team of several people is relevant only for small companies whose activities are not directly related to data. Everyone else will need at least a specialist who integrates and packs all the systems, compares them, automates, provides visibility and eliminates all the bugs that arise along the way.
To summarize. Transferring data processing pipelines to the cloud is good. For migration to work as it should, you need to carefully plan the process, taking into account all the pitfalls described above. Think a few steps forward and everything will work out.