Seven tricky questions from teachers at the MTS School of Data Analysts

School of Data Analysts from MTS — prepared seven tricky questions that novice specialists in the field of Data Science, ML and Big Data may encounter. Well, let's go!

Maxim Shalankin

Data Science Team Lead in the Fintech team at the MTS Big Data Center

1. When can and should you retrain a deep learning model?

a) When there is a free GPU and it needs to be disposed of.

b) When you need to solve a specific problem: find toxic words, highlight your intention to contact the bank, and so on.

c) In almost any task where there are clear requirements for the result.

Comment

The first answer is the worst. On the contrary, you should not shoot sparrows with a cannon, but look for the most effective approach. For example, a regular filter can find a frame in an image.

The correct answer is the last one. The main application of Deep learning is in text (if regular expressions don’t help) and graph problems (by the way, we posted them in open source CoolGraph – library for quick start with GNN), in recommendations – both for DSSM architectures and transformer approaches, not to mention content embedding, for example, based on CLIP.

In all these cases, in order to get an acceptable result, you need to find data and train your model, and this is a complex and responsible process. Underestimate its importance it is forbidden.

2. Which algorithm should I choose to determine eye color based on purchases on the site?

a) Clustering using the k-means algorithm.
b) Lasso regression.
c) Multiclass classification using the gradient boosting algorithm.

Comment

Determining eye color from internet history is an inside joke, but there is a serious theory behind it. Yes, there are many different algorithms in basic ML. There are different classes of problems, and each of them has a dozen of its own algorithms. In our example, this is a multi-class classification, because the eye can be green, brown, gray and other color variations. That is, there are conditionally 4–5 answer options, but you only need to choose one. This problem is solved by different algorithms – and one of them is gradient boosting.

Nikita Malykhin

Tech Lead in the AdTech team at the MTS Big Data Center

3. What is the main benefit of using Docker in the context of developing and deploying ML applications?

a) Docker makes it easy to deploy to any environment by creating containers that isolate applications and their dependencies.

b) Docker automatically optimizes ML application code to improve performance without developer intervention.

c) Docker replaces the need to use virtual machines, as it allows you to run applications directly on the hardware level without an operating system.

Comment

The correct answer is the first one. Docker does make deploying and managing applications easier with containers, but it doesn't automatically optimize code and doesn't run without an operating system. Packaging machine learning applications in Docker provides consistency in the environment, ensuring that models run the same from development to deployment.

With its help, you can include all the necessary dependencies in the container, such as specific versions of libraries and frameworks, and eliminate compatibility issues and dependencies on external libraries. This makes applications more portable, allowing them to be easily moved between different environments and platforms without re-configuration.

Students usually think that analysts only work with ML models, but at the school we teach analysts other technologies that they may encounter in their work. Knowing and understanding the principles of Docker, it will be easier to communicate with colleagues and you can create an MVP yourself, which greatly speeds up your work.

4. Why MLOps is needed when developing and deploying machine learning models?

a) MLOps is used solely to improve the accuracy of machine learning models by automatically tuning hyperparameters.

b) MLOps automates and optimizes the life cycle of ML models – development, testing, deployment and monitoring – resulting in more efficient and reliable model management.

c) The main goal of MLOps is to replace developers and data engineers with automated tools that build and deploy models without human intervention. Glory to the robots!

Comment

The correct answer here is the second one. MLOps does focus on automation and model lifecycle management, but is not limited to just hyperparameter tuning or eliminating human input entirely. As part of the MLOps unit at the School of Analysts, we dive into the world of integrating DevOps practices and machine learning processes and introduce key concepts of model lifecycle management.

Knowledge of tools and platforms such as Kubernetes for container orchestration, GitLab CI/CD for deployment automation, Airflow for regular model retraining, and version control systems will allow you to track changes to data and models.

5. Why is calibration of model probabilities necessary?

a) Probability calibration is necessary so that the model can produce accurate probability estimates.

b) Probability calibration is necessary to improve the speed of the model. The better calibrated the model, the faster it can make decisions.

c) Probability calibration is needed so that the model can work with any type of data, including text and graphic. Calibration makes the model universal and independent of data type.

Comment

Calibrating the probabilities of machine learning models plays a key role in ensuring the accuracy and reliability of predictions. In classification problems, calibration allows models to correctly produce probabilistic estimates. Without it, answers may be too confident or, conversely, give incorrect answers. Therefore, the correct answer is the first one.

There are several typical approaches to calibrating model probabilities. One of the most common methods is the Platt method, which uses logistic regression to convert model outputs into calibrated probabilities. Isotonic regression is also widely used, which is suitable for more complex cases and does not require assumptions about the functional form of the dependence. Both methods have their advantages and limitations, and the choice of approach depends on the specific problem and type of model.

Sergey Korolev

Head of the applied architecture competence center at the MTS Big Data Center

6. You work in a small engineering team. The day will definitely come when the customer wants to receive the results of the model in close to real time. What will you do?

a) I will write my own API to produce results “on request”.

b) I will upload the data to the Kafka data bus immediately after assembly using Spark.

c) I will delve into the business problem and weigh the value and costs of implementation together with the Product Owner.

Comment

Yes, the first two answers are indeed implementation options – and we will talk about them during the course. But in real life, even Near-Real-Time interaction is often redundant and expensive.

7. For timely scoring or building features on big data, distributed computing frameworks such as Map Reduce and Spark are used. What is the best way to speed up such calculations?

a) Ask a colleague data engineer.

b) Increase the number of computing units – workers, counting in parallel, that is, faster.

c) Reduce the number of data mixing operations and make their distribution between workers close to uniform.

Comment

The most accurate answer is the last one, and during the course we will figure out why. The strength of the course is its broad scope. We will put into practice technologies for working with big data: MapReduce, Spark and many others. Let's see how optimal carry out analytical calculations using them.

Yes, in the field of data analysis and processing there is a role for a data engineer, and the question is more related to their competence. But in practice, such a role may simply not exist in the team, and experience with such tools will be a plus for the employee. There are also stories when you need to create a prototype, and the ability to make it yourself significantly speeds up and facilitates the work of both the analyst and his colleagues.


Write in the comments how many questions you were able to answer correctly and which questions caused the most difficulties. As a rule, theory is not enough to answer such questions; personal experience or communication with more experienced colleagues is needed. At the School, through homework, we give students the opportunity to gain knowledge and skills that will enable them to solve not only these problems, but also more complex ones that arise when working with real projects.

In ten months you will learn the basics of SQL, Python, neural networks, recommender systems and MLOps. You will learn how data is stored and used in distributed file systems, you will be able to work with data in NoSQL databases for Big Data, such as Apache HBase and Cassandra, you will master the architecture of a typical Spark application, and much, much more. For more information about the course and its details, please contact us. on the website. Hurry – applications are accepted until October 20.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *