where to look for datasets and what to do with them – experts answer


The main sources of information on Data Science are books. I would recommend to beginners Elements of Statistical Learning and ML (Murphy). You need to read them from cover to cover – it will take a couple of months. It describes in detail the whole classical theory of why and why all this is needed.


There are typical data sets for each area of ​​specialization: ImageNet (CV), COCO (CV), Wikidump (NLP), OpenTTD / Freesound (audio). There are a lot of nuances with tabular data, the classic is Titanic. But there are a great many types of tasks with tablets; data features are solved. The more datasets you learn, the easier it will be later. There is also Kaggle, where there are a lot of almost any simulated data.

What to do with open datasets

According to generally accepted canons, implement architectures, compare with a standard, try to improve / change parameters, accelerate (MLE method). Or do EDA, show them to an experienced dude, try to simulate different factors / look for heuristics (DS). Groupings, in general, can be mastered along the way – this is not difficult.


Visualizations are different: help for analysis and all sorts of topics (dashboards) for business. The first can be done and shown to someone experienced, the second you will have to master on combat missions / show less prepared people. The bottom line is that good visualization simplifies perception, the amateur realized – that means the essence is well conveyed. Kaggle has visualization / data analysis contests. You can watch the best work and learn. It’s more difficult to describe, because there is no clear plan on this topic for DS, usually the emphasis is on what will come in handy in the near future.

Regarding Kaggle

There are clearly defined tasks with metrics and a leaderboard in the framework of the competition. This will teach you to make pipelines by 20–40%. But there also adult uncles perform, good results will not be for sure, this should not be upsetting.

Statement of problems, hypotheses, metrics

It is very difficult to study it yourself. Google the literature on the scientific method, teach it by heart so that you can be woken up at night and ask for the methodology. Metrics have to disassemble: what are, what features. Applying the right one will come with experience. Hypotheses will also come only with experience, but you need to try on the same datasets.

ML systems

We’ll have to collect bit by bit about how ML systems work in business. Here I can’t even give a specific link, it’s hard to advise – there are very few good compilations. You need to google, watch, communicate with colleagues – go to conferences, listen and gain reason.


Practice with datasets will provide a small percentage of the necessary knowledge. In addition to it, it is important to strictly understand why conditional method A and metric B are used and why. A theory is needed for this, and it usually floats for almost everyone, so it will not be superfluous to re-examine it. Even if a person leaves a strong university, it is still worth re-taking a course in linear algebra, mathematical statistics.

Well, it’s important to understand who you see yourself. Conditional DS is close to the analyst: less code and more immersed in the business. The best way to get business skills is an internship in a company with established processes.

Conditional MLE is good at programming, immersed in matanalysis as much as possible, communicates little with business. It is more important for him to decide on the direction track (eyesight, texts, sounds, time series, etc.; but the tablets are actually less suitable here) and to study all the key architectures himself – write them yourself, read a lot.
There are unique ones that pull both this and that, but there are very few of them.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *