How to Prepare Future Data Analysts and ML Specialists for Real-World Business Challenges

Future data analysts, BI analysts, ML developers and Data Scientists use ready-made datasets to complete student assignments to learn how to understand the principles of data processing, search for and validate hypotheses, and build predictive models.

However, the tasks that students solve are often not useful enough for both themselves and the data owners. Students do not gain experience in solving practical problems, nor do they understand that the result of their work can be useful for business. In this article, I will discuss why this happens and how to gain experience that will be useful in your work.

What data do students work with?

In preparation for their future profession, students conduct data analysis, identifying patterns, trends, and important patterns in various types of data. Such as:

  1. Synthetic data. Imitate real data, created artificially using various algorithms and models. ML models trained on them demonstrate the capabilities of machine learning well, are useful for understanding the principles and statistical support of the approaches being studied, but they show accuracy that is often unattainable in real life and can lead more to the pursuit of metrics than business value

  2. Pre-filtered data. These are datasets that are partially or completely free of noise and errors, limited in subject area, volume, and time interval. Such data helps improve the quality and accuracy of learning outcomes, but when working with them, important stages of subject area research, data collection, and validation are missed. Often, ready-made solutions to educational problems can be found for such datasets, which hinders the development of skills that will be useful in the workplace.

  3. Raw datasets. These are data that have been collected by researchers or organizations and have not yet undergone preliminary processing or cleaning. They most fully reflect the real world, but often do not have specific tasks and feedback from their owner, which is important for refining and improving hypotheses.

Students learn machine learning on filtered and noise-free data. They achieve high accuracy rates on training tasks, but struggle when working with real data in companies. Using synthetic or pre-filtered data creates the illusion of knowledge and does not prepare students for business problems.

“Everything was easy for me during my studies. I didn't have to dive into the domain area, I could just work with the machine learning model as a black box – I threw in the data and it calculated. When I came to my first job, tried to apply my skills, the result was terrible. Now I teach students myself and try to convey to them the importance of real practice, for example, at hackathons. I tell them that in reality they will have much more problems than an insufficiently good score, more important will be problems with quality, resources and interaction with stakeholders.”

Artem Galimyanov, Data Scientist, lecturer at RANEPA and Skillbox

Let me highlight the characteristics of the data that I consider important for solving business problems:

  • Realism and completeness – to achieve accuracy and reliability of the analysis results, it is necessary to take into account how much the data reflects the characteristics of the real world with its distortions.

  • Statement of the problem and formulation of hypotheses – these elements help to define the goals and direction of research, as well as justify its significance and relevance.

  • Feedback is an opportunity for a student to clarify hypotheses with the data owner, receive additional information to improve the result, and improve understanding of the subject area.

What are the sources and opportunities for working with open data?

Students need datasets of different formats that meet the specifics of the tasks and learning goals. Synthetic data and pre-filtered data help to master basic skills. However, to become a sought-after specialist, it is necessary to be able to work with raw datasets that are as close to real conditions as possible.

Raw datasets help students:

  • Diversify the learning tasks being performedprovide a field for independent setting of tasks and formulation of hypotheses.

  • Test your skills in practiceWorking with raw data helps to test the acquired knowledge in conditions close to real projects.

  • Build a portfolio. Successful projects using such data will be a great addition to your resume.

Sources of this data may be:

  1. Project activities in universities. Some universities collect databases and datasets for students to use for educational purposes. For example:

    A database of available datasets collected by the National Research University Higher School of Economics

    National Olympiad in Data Analysis for Schoolchildren of Grades 9-11

    These sources can provide high-quality datasets, but the question of correctly formulating hypotheses and tasks and obtaining feedback remains.

  2. Participation in hackathons. Hackathon is a competition where participants compete in creating innovative projects or solutions in the technological sphere. Companies provide real data and give the opportunity to immerse themselves in the domain area in a couple of days. As a result, the participant gains the skill of understanding the domain area, makes a decision based on real data, plus demonstrates their abilities to potential employers.

    All hackathons in Russia

Pros of Hakatanas

Disadvantages of Hakatanas

Data almost like in the real world

Time limits

Setting practical tasks

High entry threshold

Fast feedback

High costs for the organization

  1. Searching for data in open sources. Raw data. For example:

    Center for diagnostics and telemedicine provides sets of anonymized X-ray diagnostic images, from which smart algorithms learn to independently find pathologies.

    Moscow datasetswhich contains information about city sports and cultural events, data on city facilities – courtyards, container sites, roads and others.

The problem with open source data is that in order to use it, you need to provide the correct formulation of the task and feedback from the teacher. Therefore, the main problem when working with open datasets is the lack of ready-made tasks that are as close as possible to those performed by analysts in business.

How Real Data Can Help You Learn and Solve Business Problems

I work for a service for screening counterparties — DataNewton. This is a platform for working with information about counterparties. The data we provide is taken from more than 50 official sources. We have a lot of information on legal entities and individual entrepreneurs in Russia and we are ready to share it with students and universities.

Using DataNewton data, you can solve the following problems:

  1. Developing and training a machine learning model to predict the probability of bankruptcy of enterprises based on time series of financial indicators.

  2. Forecasting business success, recommendation systems for selecting partners and contractors for entrepreneurs.

  3. Tasks related to geodata about companies. For example, legal registration in one of the subjects can indirectly indicate the sphere of activity.

  4. OSINT (open source intelligence) is a methodology for collecting and analyzing publicly available data to obtain additional information.

Instead of a conclusion

Solving business problems by students will help develop practical skills and increase motivation in their studies. Ready-made projects that solve company problems will create interaction between universities and businesses, resulting in:

  • Students will receive real data and learn to solve what the business wants, facing real working conditions. The completed solution to the problem can attract the attention of the employer and will allow to identify motivated students ready for further cooperation.

  • Teachers will not waste time coming up with projects and topics for term papers and theses. They will be able to provide students with a list of ready-made problems and tasks from which they can choose the most interesting one.

  • The business will provide students not only with data, but also undertakes to provide feedback, receiving as a result a solution to their own problems and a list of potential employees.

This practice will help in the development and training of qualified specialists ready to work in modern market conditions.

Please write if you would like to use our data to solve educational tasks or projects.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *