Hi, my name is Victoria, and I am responsible for marketing at CROC Cloud Services. Now we regularly host cloud mitaps. I recently got on the coolest performance of Dmitry Anoshin, who now works at Amazon, and I want to share it.
I had a strong feeling that large commercial companies decided to collect generally all the possible data in the world that they could reach. On the one hand, this translates into advanced analytics, increased sales and attractiveness of products. On the other hand, the data has become so bold and comprehensive that jokes about trucks with CD-ROMs have long been commonplace.
Let's see why it might be necessary to migrate to the cloud, and what Amazon got from moving the internal infrastructure to Redshift and NoSQL DynamoDB. Let's analyze the difference between the concepts of SMP and MPP, ETL and ELT and try to understand why clouds are needed for big data.
Well, if you are aware of what has been happening in the industry in recent years, then browse immediately to a specific case. Come under the cut, I prepared a summary of the main points of the performance.
Telemetry from each light bulb
Large companies have a very noticeable trend towards the formation of integrated ecosystems around their users. That is, you woke up, went to brush your teeth and at the same time you look through the news in a multimedia mirror. The Alexa column includes peppy music in the morning and reminds of today's meetings. Here you order fresh coffee with home delivery, as the old one is already running out. You get into the car, and then again Alexa, which is integrated with the car multimedia system and continues to accompany on the road. Plus a smart bracelet, headphones, applications in the phone and thousands of other sources of information.
This is at the same time a slightly frightening future, which is rapidly coming from all directions, an attempt to create additional value for the end consumer from companies. Agree, it's cool when, for example, under the Amazon Key In-Car program, your purchases will be delivered directly to the car trunk in the parking lot. I now live in Canada, and such integrations make life much more comfortable. For the company, this is also very valuable data in terms of sales targeting, demand forecasting, logistics optimization and more. Win-win.
One problem. As I said, there is a strong feeling that companies often collect data on an excessive scale in the hope of monetizing them in the future. And these are terabytes. In reality, terabytes of poorly structured information that continuously flows onto the company's servers, devouring network, computing and storage resources. That is why the problem of optimal utilization of resources and ensuring the speed of computing is so important. And you also need to give business analysts a normal interface that does not require them to have expert knowledge in building cloud infrastructure. Therefore, many large companies have moved towards the clouds.
There is no cloud
Cloud technology is the buzz-word that pretty much got everyone. No, no doubt, he looks solid in the financial statements of the company and at official presentations. Nevertheless, at the iron level, these are all the same good old servers located in data centers around the world. However, cloud computing needs more than just a convenient virtualization console. The main feature of the clouds is the fully dynamic management of resources and their automatic scaling when necessary:
- The calculation.
- Network resources and transport.
When you have such an infrastructure, you will utilize your resources much more fully, which with large-scale business cases can result in significant savings.
For small companies, this approach can also be very attractive. Imagine that you are planning to purchase new iron for your infrastructure next year. At the same time, it is very difficult for you to predict the exact load, which can vary from many factors. For example, your product suddenly suddenly becomes wildly popular due to a successful publication on Habré, a whole crowd of customers rush into you and wildly disappointed because you did not plan such peak loads. And there may be a reverse situation when you overestimate demand, buy excess capacity and eventually get idle equipment, which actually removes much-needed money from the company's turnover. A bet solely on the purchase of iron capacities is almost always an extremely inert process, and it certainly loses in adaptability in a rapidly changing market.
Particular or complete migration to the cloud is suitable for such situations, which serves as a kind of capacitor that smooths out peak consumption spikes. Or even completely provides you with infrastructure.
Types of clouds
In fact, depending on their business model, companies usually come to one of three forms of building cloud systems. A small business usually uses public clouds and saves on the appropriate specialists, focusing on its product. Particularly large ones in themselves are similar to many separate companies connected by a common goal and brand. Therefore, they often build private clouds, achieving optimal resource utilization. Part uses hybrid models, which allow you to process particularly sensitive, legally protected data locally and transfer minor tasks to external clouds. Pizza as a service:
I always really liked this illustration, which shows well the degree of delegation of your company's infrastructure tasks to the vendor.
The traditional On-Premises option is to go buy food, preheat the oven, and cook pizza yourself. Perfect! But you need to have all the equipment, ingredients and more.
IaaS is an infrastructure rental option. You rented a kitchen with all the equipment, brought your own products and prepared an excellent pizza. Specially trained people will wash the oven from fat, and you do not need to worry about the sharpness of knives and other trifles.
PaaS is a platform as a service. The service provides you with some additional goodies in addition to bare infrastructure. For example, Amazon Redshift – as a data warehouse, which allows you to save on DBA and focus on the product. In our pizza example, it can be, for example, a ready-made shaped dough that can only be thawed, spread with aromatic sauce, sprinkled with mushrooms, slices of tender bacon and grated parmesan.
The final option is SaaS. In this case, you get the most finished product on the basis of which you build your business. For example, run a blog based on someone else’s public platform. In our example, this will be the most expensive, but simple option to order a ready-made pizza at home.
Truck data. Snow mobile
There is an old bearded joke from the time of the “zero” years: “A team of truck drivers was able to deliver 100,000 CDs from Odessa to Kiev in one night. Thus, they reached a data transfer rate of 2.43 terabytes per second over a distance of more than 500 km without the use of expensive cables. "
At that time it was just a joke. However, with the modern volumes of a continuous stream of photos from each mobile phone, audio, video and other telemetry, it becomes completely unmockable and turns into a real problem. When you don’t have a direct rented thick optical link to a data center, moving large amounts of data to the cloud can be a huge problem. Here services such as Amazon's Snowball come to the rescue.
You bring such a brutal protected case packed with 50 terabytes of high-speed disks and 10-gigabit network interfaces. Then you connect it directly to your store and merge all the data at maximum speed. In case of theft or other troubles, the data leaves your server room only in encrypted form. There is a TPM module in the case, and the encryption keys are managed using the AWS Key Management Service (KMS). Encryption keys are not stored on the device itself.
In especially advanced cases, you can call Snowball Truck – a mobile data center with a capacity of 100 petabytes. When data scales approach exabytes, a typical 10-gigabit connection will require 26 years for data transfer. And such white trucks will be able to drag and drop data for six months.
Amazon Migrating from Oracle to Redshift
What we had
I'll tell you a little about the case with which I worked at Amazon. Major trading platforms like Amazon have a very painful piece of work – Prime Days. These are peak Black Friday sales and Christmas sales. At this point, the servers are melting under load, the warehouses are crowded with loaders, and the logistics are choking under a continuous flow of goods. This is a very important time from the point of view of sales, and every hour of downtime or inaccessibility of the service costs a huge amount of loss.
The problem came from Oracle DB. The database simply stopped exporting such a volume of simultaneous queries, experiencing problems with scaling. The site was practically developing under the onslaught of customers, and the database became a problem in terms of scaling.
After careful analysis, they came to the conclusion that traditional SQL databases are not suitable as a backend for a trading platform of this magnitude. Plus, Oracle is also extremely expensive in terms of licenses and support. As a result, it was decided to migrate to their cloud platform, which was based on Redshift and NoSQL DynamoDB.
DynamoDB was an internal development with synchronous replication between data centers and an extremely effective mechanism for reducing data redundancy, which allowed to significantly save on their storage. A very important feature was Auto Scaling – dynamic database scaling for the required amount of data. Great integration with Hadoop has also been worked out.
What is the main problem of a traditional database?
The problem is that the old version with Oracle refers to the SMP architecture, which only involves vertical scaling. That is, you have a powerful machine with a certain memory, a bunch of fast storage, and all requests flock to it one way or another. This is a classic Oracle model that focuses on delivering its powerful stand-alone servers. At the same time, the company did not particularly believe in clouds, and parallel computing was not considered a promising solution. And we needed MPP – a parallel architecture that allows you to smudge a request for many separate machines and process data faster.
There is another important point – ETL vs ELT-approach to entering data into the database.
ETL – Extract -> Transform -> Load. That is, we first receive data from our sources, carefully structure them, and only then fill it into our storage. The ELT approach involves filling raw noisy data into the storage and processing is already on its side. In principle, RedShift supports both approaches, but ELT has an advantage: access to filtered data is faster and easier to manipulate. Although at the same time more resources are spent on the initial analysis of raw information. There is another unobvious moment. ETL reduces the risks in terms of GDPR in European law by filtering sensitive information in advance before it reaches the general repository. This reduces the risk of unauthorized access to data. The main tool for primary data processing in the new architecture was Matillion. There is already a nice GUI there, it is highly configurable and already comes in an option tailored to Amazon RedShift. Thanks to him, it turned out to lower the threshold of entry. Now product managers can configure incoming data streams in the form of a visual designer without the help of our data engineers.
That's how our final architecture began to look like this. We got the flexibility, scalability and smoothing of peak loads that we need. For example, they were able to solve the problem of raking 50 GB of web server logs per day to predict the behavior of visitors.
We also introduced Tableau, which allowed us to switch from poorly linked tables in Excel to single dashboards, convenient for management.
And I’ll explain just in case: there is an Oracle OLTP (backend) at the store, there is Oracle DW – an analytical data warehouse. The project was aimed at both things, but I'm talking specifically about Oracle DW! That is, the diagram and description given are local, they concern only the Amazon team. The same goes for Tableau. When I say “we implemented the Scoreboard”, I mean the local project, since in Amazon everything is divided into teams, and everyone chooses what to do and what to implement and use.
The clouds, despite the somewhat unhealthy hype around them, are already the current reality. Most likely, most business projects will somehow be built around such infrastructure. Yes, perhaps not every company will have such solutions. But it’s worth planning for further development now, otherwise it will be difficult to quickly respond to rapidly changing market parameters and fierce competition.
If anyone is interested in the topic of cloud analytics and modern solutions, go here. I drop useful content there.
Come to our meeting
CROC Cloud Services has already hosted a series of speeches by excellent speakers, for example, the topic of one mitap was the practical use of AWS services in life. Next year we have planned several more events, we will talk about them in detail. Follow the events.