# The most important idea in Data Science

### Tips for separating distractions from useful information

If you take an introductory course in statistics, you will realize that data can be used to search for inspiration or test theory, but never for both. Why is that?

People are too good at finding patterns in everything. You yourself determine which patterns really exist and which are invented. We are creatures that find Elvis’ face in a potato chip. If you are tempted to equate patterns to concepts, remember that there are three kinds of patterns:

• Patterns that exist both in your dataset and beyond.
• Patterns that exist only in your dataset.
• Patterns that exist only in your imagination (apophenia).

A pattern of data can exist (1) in the entire population of interest, (2) only in the sample, or (3) only in your head.

What patterns and data patterns may be useful to you? It depends on your goals.

### Inspiration

If you need pure inspiration – data can be a miracle. Even apophenia (the human tendency to mistakenly perceive the connections and meaning between unrelated things) can make your creative work to its fullest. Creativity does not have the right answers, so all you have to do is look at your data and play with them. As an added bonus, try not to waste too much time (yours or those interested) in vain.

### Facts

When your government wants to collect taxes from you, it cannot but pay attention to values ​​that go beyond your financial data for the year. The tax service needs to make a factual decision on how much you owe and the main way to make this decision is to analyze data from last year. In other words, look at the data and apply the formula. In this case, we are talking about purely descriptive analytics, tied to the available data. Any of the first two types of patterns is well suited for this.

Descriptive analytics tied to existing data.

(I never hid my financial statements, but I think that the United States government would not be thrilled if I used the methods of calculating data that I learned in graduate school to pay taxes statistically to replace them.

### Uncertainty Solutions

Sometimes the facts do not coincide with the desired. When you do not have all the information necessary to make a decision, you should be guided by uncertainty, trying to choose a reasonable course of action.

This is precisely what statistics are – the science of how to change your mind in the face of uncertainty. The game is to jump into the unknown like Icarus … and at the same time not to smash into smithereens.

This is the main task of data science: how not to be * uninformed * as a result of studying data.

Before jumping from this cliff, it is better to hope that the patterns that you found in your limited view of reality actually work outside of your view. In other words, in order to be useful to you, templates should be generalized.

Of the three types of patterns, when making decisions in the face of uncertainty, only the first (generalized) is safe. Unfortunately, you will find other types of patterns in your data – this is the big problem underlying data science: how not to lose your own awareness as a result of studying the data.

### Generalization

If you think that finding useless patterns in data is a purely human privilege – think again! If you are not careful, then machines can do the same stupidity automatically.

The whole point of machine learning and AI is to properly generalize new situations.

Machine learning is an approach to making many similar decisions, which involves an algorithmic search for patterns in your data and their use to correctly respond to completely new data. In the jargon of machine learning and AI, generalization refers to the ability of your model to work well with data that it has not yet seen. What is the point of a template-based model that works only with old data? To do this, you can simply use the lookup table. The whole point of machine learning and AI is to correctly make the correct generalizations in new situations.

That is why the first type of patterns on our list is the only one that is well suited for machine learning. Data of this kind is a signal, everything else is just noise (factors that exist only in your old data and interfere with the creation of a generalized model).

Signal: patterns that exist both in your data set and beyond.

Noise: patterns that exist only in your data set.

In fact, getting a solution that processes old noises rather than new data is what is called machine learning overfitting (we pronounce this term in the same tone in which you pronounce your favorite curse word). In machine learning, almost everything is done to avoid overfitting.

### So what kind of * this * sample refers to?

Suppose that the pattern that you (or your computer) extracted from your data exists outside your imagination – what category does it belong to? Is it a real phenomenon that exists in the aggregate of interest to you (signal) or is it a feature of your dataset (noise)? How to determine the type of pattern detected when working with data?

If you examine all the available data, then you will not be able to do this. You will be stumped and cannot tell if your template exists anywhere else. All the rhetoric about testing statistical hypotheses depends on unexpectedness, and pretending that the already known pattern surprises you is a bad taste (in fact, this is hacking).

It’s like seeing a rabbit-shaped cloud and then checking to see if all the clouds look like rabbits … looking at the same cloud. I hope you understand that you will need new clouds to test your theory.

Any data used to formulate a theory or question cannot be used to verify the same theory.

What would you do if you knew that you have access to only one cloud? Meditated in the pantry, that’s what. Ask your question before you look at the data.

Mathematics never runs counter to common sense.

Here we come to the saddest conclusion. If you use your dataset in search of inspiration, then you cannot use it again to thoroughly test the theory that it inspired (no matter what tricks of mathematical jujitsu you use – mathematics never contradicts common sense).

### Hard choice

The point is that you have to make a choice! If you have only one data set, then you are forced to ask yourself: “I meditate in the closet, formulating my hypotheses for statistical testing, and then carefully take a strict approach – and all this so that I can take myself seriously? Or am I just collecting data for inspiration, and at the same time I understand that I can be fooling myself and remember that I should use phrases like “I feel” or “it inspires” or “I’m not sure”? ” Hard choice!

Or is there a way to eat one piece of cake twice? The problem is that you only have one data set, and you need more than one data set. And if you have enough data, then I have a trick that. Blow up. Your. Brain.

### Tricky trick

To succeed in data science, just turn one data set into two (at least) by dividing your data. Then use one for inspiration and the other for rigorous testing. If the pattern that originally inspired you exists in the data that could not influence your opinion, then it is likely that this pattern is a general rule that operates in the cat tray from which you take your data.

If the same phenomenon is observed in both sets of data, this is probably a general rule that manifests itself in all sources of these data.

### RSChD!

Since life without research is not life at all, here are four words worth living: share your damn data.

The world would be better if everyone shared their data. We would have better answers (thanks to statistics) and better questions (thanks to analytics). The only reason people do not view data sharing as a mandatory habit is because in the last century it was a luxury that very few could afford. The data sets were so small that if you tried to separate them, then perhaps there would be nothing left of them.

Separate your data into a research dataset that is accessible to everyone, which can be used for inspiration, and a test one, which will later be used by experts to accurately confirm any “guesses” found at the research stage.

Some projects still face this problem, especially in medical research (I used to do neurobiology, so I have great respect for the complexity of working with small data sets), but many of you have so much data that you need to hire engineers, just to arrange their movement … what excuse do you have ?! Do not skimp, share your data.

If you don’t have the habit of sharing data, you may be stuck in the 20th century.

If you have a lot of data, and their sets are not divided, then you exist in an outdated paradigm. People existing in this paradigm have come to terms with archaic thinking and refused to move further in time.

### Machine learning is a descendant of data sharing

In the end, the idea is simple. Use one data set to form a theory, understand this data, and then do the magic – prove the truth of your ideas on a completely new data set.

Sharing data is the easiest quick fix for a healthier data culture.

This way you can safely use statistically methods and insure yourself against overfitting. In fact, the history of machine learning is the history of data sharing.

### How to use the best idea in data science

To take advantage of the best idea in data science, all you have to do is make sure you keep test data out of the reach of prying eyes, and then let your analysts go crazy over everything else.

To succeed in data science, just turn one data set into (at least) two by dividing your data.

When you decide that they brought you useful information that goes beyond what they learned, use your secret cache of test data to verify your findings.

Learn the details of how to get a sought-after profession from scratch or Level Up in skills and salary, taking SkillFactory paid online courses: