# How to learn Big Data – experts answer

Where does a novice programmer need to start exploring the Big Data topic, what skills do you need to have?

To answer this question, we must first distinguish between what point of view we approach the study of Big Data.

It could be:

1. Collection and storage of data.
2. Data analysis.

Data analysis is a lot of math. This knowledge of algorithms and mathematical methods, and sometimes very specific. This is intuition, including mathematical intuition, if I may say so. For example, how to look at the same numbers in a different way.

On the other hand, the entire Big Data area would not be in demand if billions of records were not collected, processed and stored. Therefore, it is important to efficiently record, store and process data.

A novice developer needs to understand for himself what he will cope with, what he is primarily interested in: mastering frameworks, writing all kinds of data collection modules, starting to work with such tools as Kafka, Spark, Elastic Stack, Hadoop and other similar technologies in context It is the collection and accumulation of data. Or, as a mathematician, it is interesting, first of all, to gradually learn how to build simple models, simple data sampling, and then go deeper and deeper into the process, using one or another language specific for data analysis.

Very often, data is analyzed using the Python or R language, and more specialized languages ​​are also used. Some do data analysis in functional languages ​​like Scala, Haskell. But data is often collected using Java, and generally by what was invented there on the JVM machine and is applicable for collecting, analyzing, and processing data. Netty, Akka, Kafka, Spark and so on and so forth.

I would like to note that a beginner programmer does not need to think about it at all. That is, the very concepts of "novice programmer" and "Big Data", they are somewhat incompatible.

The programmer must be quite experienced to, for example, engage in data collection. Or, if he is a data analyst, then he is no longer a programmer, but more a mathematician, one who is “friends” with system analysis, with mathematical probability, with statistical errors and the like for research.

For example, today the usual Big Data developer’s path is relatively simple at the first stage of the application according to the standard scheme: we got the data, wrote it to the relational database, learned how to get, filter, aggregate, group, build fact tables, use column NoSQL solutions for accumulation and subsequent sampling. Sometimes even the relational database itself can be omitted, but still no one canceled the SQL language. Why? Because on popular NoSQL solutions, the same Spark, which is part of a large Hadoop ecosystem, it often makes sense to present data as a kind of data frame, which is an analogue of a table in a relational database, and then apply SQL-like queries. Sometimes it’s more convenient and faster, more efficient in the context of the data or experience of the development team.

To become a Big Data developer, you need to have a great outlook, as well as get into a team that is already effectively engaged in Big Data.
But, as a rule, experienced people from other, related fields, rather than novice developers, are invited to these teams.

Becoming a Big Data related programmer is very difficult. It's hard to even want. That is, you need to understand why you got there, why you want to work with it. And this means that experience is needed. And experienced programmers have experience. A play on words, but still.

What skills should be possessed and what tools should be used?

A programmer should have a good knowledge of algorithms and data structures. Good algorithmic training and all kinds of structures, data representations. Why? For example, B-Tree. This is a convenient structure for storing data in the file system, that is, on disk. Sooner or later, all the data goes to disk. And if we offer some other storage solution, we should very well imagine if there is anything suitable and well studied in the B-Tree data structure family with known sampling time, memory efficiency, and so on, so that our “other solution” did not turn out to be a “bicycle” or a poorly studied solution with unknown problems. This is strange, but often even experienced developers use B-Tree indexes in databases without understanding how they work. I'm not talking about the fact that the hash table, Bloom filter, SS Table, you just need to understand very well and be able to use it effectively.

And this is only experience, experience and experience. Theory is good, but to know and be able to apply – things are different.

In addition, there is an important specificity of working with big data. Because, as we only talk about Big Data, we immediately get "big" problems. We always have little RAM, we always have little disk space, and we always have few nodes in the cluster. This is a normal story that must be able to be transformed from a sad to a workable one. And who will entrust an inexperienced person, for example, a Linux cluster with 100 nodes? For the customer, for the company – this is money. Electricity wasted by an inexperienced team is comparable to the cost of experienced developers working with big data.

Big Data is not two computers in a company that consume a bit of electricity, and, say, downtime which is uncritical. This is a large coordinated cluster with complex software and high costs of operation and maintenance. And, as a rule, this big data and analysis is intended for someone. If they disappear, are inefficiently processed, then the project is doomed to failure … It is important not only to analyze the data correctly, but also to manage to do it in a certain finite time interval, to have time to give an answer by a specific date.

For example, some advice based on analysis at the request of marketers of a large online store is needed, say, within a month. If a month has passed, and there is no clear answer to the asked question, then everything, the discount season is over, we still didn’t know what products we had to offer the customer of the online store on Black Friday (conditionally). As a result, they prepared, researched, counted, and counted nothing for a year. It will be very painful for the company. Will she entrust work with Big Data to novice developers and researchers?