What algorithms and data structures need to be mastered by a novice Data Science specialist – experts answer

To start, you need to know the basics: how classical machine learning works, how to solve regression problems (linear, logistic, random forest, gradient boosting on decision trees, etc.), classifications. Most often, Gradient Boosted Trees / Random Forest is used in life, but there are also problems with logistic regression. This applies to supervised learning.

If we talk about unsupervised learning algorithms that allow you to do analytics without targeted markup, then it is important to know how to reduce dimensionality, because they often help to get rid of unnecessary factors, to facilitate the model before output to production, for example. You also need to know the clustering methods well, among them the k-means (k-means), HDBSCAN method are popular. The first one is suitable when we manually set the number of clusters. For example, we have information about people’s height and weight, we know that production facilities can produce 100 T-shirts, and we know that there are 5 sizes of T-shirts – 5 clusters. The task is to understand how many shirts actually need to be made in order to ensure market demand, so that, for example, there are not many L or S sizes. And the k-means algorithm helps to choose how many people fit each cluster. Another algorithm helps to choose the required number of clusters.

Classic Machine Learning methods are good, but often do not meet business expectations because of how they work and what quality. For example, if in the trained model the weight of one of the factors is 70-80%, then the business may decide that the model is linear and not accept the results. A young specialist should choose a direction in which it is interesting to develop professionally. For example, if it is interesting to engage in predictive analytics, then you need to additionally study the analysis of time series (there are a lot of their own specifics), in-depth training. You also need to know how classic time series analysis models work (ARIMA / SARIMA, etc.).

Another direction is recommendation systems. They are resorted to when it is necessary to restore dependence: which user will choose which product. A great example is Netflix, which recommends relevant video content based on preference. In this direction, the basic methods are singular decomposition, collaborative filtering, factorization machines. You can, of course, immerse yourself in the study of quantum Boltzmann machines, but it will most likely be needed only for research – not for work.

Another interesting area is Natural Language Analysis (NLP). Those who wish to develop in this direction need to know the basic models and understand the current developments – gensim, word2vec, SpaCy, BERT. The latter is quite heavy; it is needed for large-scale development by large IT companies. In ordinary life, it is difficult to use.

Another area that has been very popular lately is computer vision. In order for a specialist to develop in this, one needs to study both classical models – for example, analytical algorithms based on the OpenCV library – and more advanced and stable tools, for example convolutional neural network.

Thus, to start a Data Scientist career, you need a base, after which you can study in depth the domain areas within which you can develop. Of course, at any moment you can easily change direction, because in principle, “under the hood” is about the same thing, only understands different tools and from different angles.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *