5 approaches to data markup for machine learning projects

When we revised our course on Deep learningto make it more visual and case-oriented from real business practice, we have included a new module on data markup on the Yandex.Toloka crowd platform.

But since crowdsourcing is not the only way to markup, we have prepared for new students of the course a translation of this article from the Lionbridge blog with an overview of the main approaches to data markup. We hope you find it useful too.

The quality of a machine learning project directly depends on how you approach the solution of 3 main tasks: data collection, its preprocessing and markup.

Markup is usually a complex and time-consuming process. For example, image recognition systems often involve drawing bounding boxes around objects, while product recommendation systems and sentiment analysis systems may require knowledge of the cultural context. Do not forget also that a data array can contain tens or more thousands of samples that need markup.

Thus, the approach to creating a machine learning project will depend on the complexity of the task at hand, the scope of the project and the schedule for its implementation. Considering these factors, we have identified 5 main approaches to data markup and provided arguments for and against each of them.

The various ways to mark up data for machine learning fall into the following categories:

In-house: As the name suggests, it is about data markup by our own analyst team. This approach has a number of obvious advantages: the process is easy to control and you can be confident in the accuracy and quality of work. However, this method is most likely suitable only for large companies with their own data analyst staff.

Outsourcing: this is a good way in cases where the command to mark up the data is needed for a certain period of time. By placing an ad on recruiting sites or in your social networks, you can form a base of potential performers. Further, during the interview and testing, those who have the necessary skills will be determined. This is a great option for forming a temporary team, but this requires clear planning and organization; new employees will need to be trained to get involved and complete the job as required. Also, if you don’t already have a data markup tool, you’ll need to purchase one.

Crowdsourcing: crowdsourcing platforms are a way to solve a specific problem with the help of a large number of performers. Since crowdsourcing has performers from a wide variety of countries and can be filtered by level, it turns out to be a fast and fairly budgetary method. That being said, crowdsourcing platforms vary quite a bit in terms of performers’ qualifications, quality control, and project management tools. Therefore, when choosing a crowdsourcing platform, you need to consider all these parameters.

Synthetic method: synthetic markup means creating or generating new data containing the attributes required by your specific project. One way of doing synthetic markup is to use a generative adversarial network (GAN). GAN employs two neural networks (generator and discriminator) that compete with each other to create false data and distinguish between real and false data. The result is highly realistic new data. GAN and other synthetic markup methods allow you to get completely new data from existing arrays. This method is highly time efficient and is excellent for obtaining high quality data. However, at present, synthetic partitioning methods require a large amount of computing power, which makes them very expensive.

“Program method”: provides for the use of scripts for automatic data markup. This process allows you to automate tasks, including marking up images and texts, which can significantly reduce the number of performers. Plus, the computer program won’t take rest breaks, which means you can get results much faster. However, this method is still far from perfect, and with programmatic markup, a quality control team is often needed to monitor the correctness of data markup along the way.

In this table, we provide a visual comparison of the above methods:




Process control

High quality

Predictable outcome

Time consuming


The ability to assemble a team for a specific task

Time for training

Planning, organization of the process



Global challenges


The cost of work

Difficult to control quality

Resources required to collect platform data

Companies specialized in data processing

High quality


Global challenges


High price

Synthesis and expansion

Time efficiency

A lot of data can be collected for training

High computing power required

Programming method



Low quality level

Each markup method has its own strengths and weaknesses. Which method is best for you depends on a number of factors: the complexity of the use case, the training dataset, the size of your company and analyst team, your budget, and your deadlines. Be sure to consider all of these factors when planning your data markup project.


Deep Learning Course 6.0 from Newprolab started on November 9.

The next course, Deep Learning 7.0, will run from March 30 to April 22, 2021.

Similar Posts

Leave a Reply