One of the world’s first data storage and sharing technologies.
In the 19th century, doctors could prescribe mercury for mood swings and arsenic for asthma. It may not have occurred to them to wash their hands before surgery. Of course, they did not try to kill anyone – they simply did not know that there were more suitable methods.
These early doctors had scratched valuable data in notebooks, but each of them saw only one piece of the big puzzle. Without modern tools for the exchange and analysis of information (as well as science to comprehend these data), nothing could stop superstitions from influencing what can be seen through the “keyhole” of the observed facts.
Since then, people have come a long way with technology, but today’s boom in machine learning and artificial intelligence is not divorced from the past. All this is a continuation of the basic human instinct – comprehension of the world around us. This instinct is needed so that we can make more intelligent decisions. And now we have significantly better technology than ever.
One way to describe this pattern going through the ages is to present it as a revolution in data sets, not data units. The difference is nontrivial. Datasets have helped shape the modern world. Consider the Sumerian scribes (modern Iraq) who pressed their styluses to clay plates more than 5,000 years ago. When they did this, they not only invented the first writing system, but also the first technology for storing and exchanging data.
If you’re inspired by promises that AI can surpass human ability, consider stationery to give us superhuman memory. Although it’s easy to take record of information for granted today, the ability to reliably store datasets is an innovative first step towards higher intelligence.
Unfortunately, extracting information from clay plates and their pre-electronic counterparts is a pain. You cannot click on a book to count the number of words in it. Instead, you have to load every word into the brain to process it. Similar problems made early data analysis labor-intensive, so early attempts got stuck at the earliest stages. While the kingdom could analyze tax revenues, only a fearless soul could try to reason just as effectively in a field like medicine, where millennia-old traditions encouraged improvisation.
Fortunately, the human race gave birth to incredible pioneers. For example, John Snow’s death map, compiled during a cholera outbreak in London in 1858, inspired physicians to reconsider the superstition that the disease was caused by miasma (toxic air) and pay attention to drinking water.
If you know the Lady with the Lamp, Florence Nightingale, for her heroic compassion as a nurse, you may be surprised to learn that she was also a pioneer of analytics. Her ingenious infographics during the Crimean War saved many lives, because with her help it was possible to determine that the main cause of death in hospitals was hygiene, and it was this infographic that inspired the government to pay attention to sanitary standards.
The era of uniform datasets arose as the value of information began to be established in an increasing number of areas, which led to the emergence of computers. And it’s not about the electronic buddy you are used to today. The “computer” (calculator) arose as a human profession, when special employees performed calculations and processed the data manually to evaluate their significance.
All these people were computers! Photograph taken in the 1950s, this is staff Supersonic Pressure Tunnel.
The beauty of the data lies in the fact that they allow you to form a judgment from something more meaningful than rarefied air. By looking at the data, you are inspired to ask new questions, following in the footsteps of Florence Nightingale and John Snow. This is the discipline of analytics: inspire models and hypotheses through research.
From datasets to data sharing
At the beginning of the 20th century, the desire to make better decisions in the face of uncertainty led to the birth of a parallel profession: statistics. Statisticians help to check whether it is reasonable to behave in accordance with the phenomenon that the analyst found in the current data set (and beyond).
A famous example is Ronald A. Fisher, who developed the world’s first textbook on statistics. Fisher describes a hypothesis test in response to his friend’s claim that he could determine if milk was added to tea before or after water. Hoping to prove that this was not true, on the basis of the data he had to conclude that his friend really could do it.
Analytics and statistics have a big Achilles heel: If you use the same data unit to generate a hypothesis and test it, then you are cheating. Strict statistics require you to declare your intentions before you take appropriate action. Analytics is more like an extended retrospective game. Analytics and statistics were sadly incompatible until the next major revolution (data sharing) changed everything.
Sharing data is a simple idea, but it is one of the most important ideas for scientists like me. If you have only one data set, you should choose between analytics (unproven inspiration) and statistics (rigorous conclusions). Want a trick? Divide your data set into two parts, and you will have both the wolves full and the sheep intact!
The era of the two data sets removes the tension between analytics and statistics and introduces coordinated work between two different types of data specialists. Analysts use one data set to help you formulate questions, and statisticians use another data set to give rigorous answers.
This luxury places stringent data requirements. It is easier to talk about separation than to really realize it. You understand what it is about if you tried to collect enough information for at least one decent data set. The era of dual data sets is a new development that goes hand in hand with better data processing equipment, lower storage costs and the ability to share collected information over the Internet.
In fact, the technological innovations that led to the era of dual datasets quickly marked the next stage – the era of automatic datasets consisting of three datasets.
There is a more familiar term for this: machine learning.
Using a dataset destroys its purity as a source of statistical rigor. You have only one chance, so how do you know which “insight” from analytics is most worthy of testing? If you had a third data set, you could use it to take a test drive of your idea. This process is called validation, and it underlies what makes machine learning work.
Once you are free to put everything to the test and be able to see sustainable ideas, you can entrust anyone with a search for solutions: experienced analysts, trainees, fortune-telling tea leaves, and even algorithms that work without context about your business problem. The solution that will best prove itself during the validation process will become a candidate for the corresponding statistical test. You have just given yourself the ability to automate inspiration!
That’s why machine learning is a revolution in the field of datasets, not just data. It’s all about the luxury of having enough data for tripartite separation.
How does AI fit into this picture? Machine learning using multilayer neural networks is technically called deep learning, but it has received yet another nickname, which has been enshrined in speech: AI. Although AI once had a different meaning, today it is most likely used as a synonym for deep learning.
Deep neural networks created a stir due to the fact that they surpassed traditional machine learning algorithms in many complex tasks. However, their training requires much more data, and the requirements for data processing capabilities go beyond the capabilities of a conventional laptop. That is why the advent of modern AI is associated with cloud technology. Cloud technology allows you to rent someone else’s data center instead of assembling equipment yourself, so you can try out the technologies of modern AI before you start investing in them.
With this piece of the puzzle, we get a complete set of professions: machine learning experts and AI, analysts and statisticians. The general term that describes each of them is an expert in Data Science, the science that makes data useful.
Data Science is the product of our era of triple datasets. Many modern industries regularly generate more than enough data. So is a four-data set approach possible?
What is the next step if the model you just trained shows low validation values? If you behave the same way as most people do, then you will immediately demand to find out the reason! Unfortunately, there is no data set that could answer your question. You may be tempted to get into your validation data set, but alas, debugging will violate its ability to effectively validate your models.
By analyzing your validation data set, you essentially turn three data sets back into two. Instead of doing something useful, you involuntarily returned to the past!
The solution lies outside the three data sets you are already using. To arrive at smarter learning iterations and hyperparametric tuning, you’ll want to get closer to best practices: the era of four data sets.
Assuming that three sets of data provide you with inspiration, iterations of training, and rigorous testing, the fourth will accelerate your AI development cycle thanks to advanced analysis methods aimed at obtaining information about which approaches can be tested at each iteration. Using four-way data sharing, you can take advantage of the abundance of data! Welcome to the future.
Learn the details of how to get a sought-after profession from scratch or Level Up in skills and salary by completing SkillFactory paid online courses:
- Machine Learning Course (12 weeks)
- Learning Data Science from scratch (12 months)
- Analyst profession with any starting level (9 months)
- Python for Web Development Course (9 months)
- Trends in Data Scenсe 2020
- Data Science is dead. Long live Business Science
- The coolest Data Scientist does not waste time on statistics
- How to Become a Data Scientist Without Online Courses
- 450 free courses from the Ivy League
- Data Science for the Humanities: What is Data
- Steroid Data Scenario: Introducing Decision Intelligence