Data Science for the Humanities: What is Data

Reflections on information, memory, analytics and distributions

All that our senses perceive is data, although their storage in our turtles leaves much to be desired. Recording it is a little more reliable, especially when we record it on a computer. When these records are well organized, we call them data … although I have seen some badly organized electronic scribbles get the same name. I’m not sure why some people pronounce the word data as if it has a capital letter D.

Why do we pronounce data with a capital letter?

We need to learn to be disrespectfully pragmatic about data, so this article will help beginners look behind the scenes and help practitioners explain the basics to beginners who show symptoms of data worship.

Meaning and meanings

If you start your journey by purchasing datasets online, you risk forgetting where they come from. I will start from scratch to show you that you can do data anytime, anywhere.

Here are a few permanent inhabitants of my pantry, spread out on the floor.

image

This photo is data – it is stored as information that your device uses to display beautiful colors.

Let’s look at what we are looking at. We have endless options for what to pay attention to and remember. This is what I see when I look at products.

image

If you close your eyes, do you remember every detail of what you just saw? No? Me neither. That is why we collect data. If we could remember and process it flawlessly in our heads, that would not be necessary. The Internet could be one hermit in a cave, talking about all the tweets of humanity and perfectly conveying each of our billions of photographs of cats.

Writing and Durability

Since human memory is a leaky bucket, it would be better to write down the information as we did before when I was at the school of statistics, back in distant years. Exactly, my friends, I still have paper somewhere here! Let’s record this 27 data.

image

What’s good about this version — regarding what’s in my hippocampus or on my floor — is that it’s more durable and reliable.

Human memory is a leaky bucket.

We take the memory revolution for granted, as it began millennia ago with merchants who need reliable records of who sold to whom, how many bushels of what. Take a little time to understand how wonderful it is to have a universal writing system that stores numbers better than our brains. When we record data, we misrepresent our richly perceived realities, but after that we can transfer imperishable copies of the result to other representatives of our species with perfect accuracy. The writing is awesome! Small pieces of mind and memory that live outside our body.

When we analyze the data, we gain access to other people’s memories.

Worried about machines that transcend our brains? Even paper can do it! These 27 small numbers are a large amount for your brain, but durability is guaranteed if you have a writing instrument at hand.

Although this is a gain in longevity, paper work is annoying. For example, what if it suddenly dawns on me to rearrange them from larger to smaller? Abracadabra, paper, show me the best order! – No? Heck.

Computers and magic spells

Do you know what is surprising about software? Abracadabra actually works! So, let’s go from paper to computer.

image

Spreadsheets leave me indifferent. They are very limited compared to modern data processing tools. I prefer to fluctuate between R and Python, so let’s take R this time. You can repeat after me in your browser using Jupyter: select the “with R” tab, then click the scissors icon several times until everything is removed. Congratulations, it took 5 seconds and you are ready to insert my code snippets and run it [Shift + Enter].

weight <- c(50, 946, 454, 454, 110, 100, 340, 454, 200, 148, 355, 907, 454, 822, 127, 750, 255, 500, 500, 500, 8, 125, 284, 118, 227, 148, 125)
weight <- weight[order(weight, decreasing = TRUE)]
print(weight)

You will notice that the R gibberish for sorting your data is not obvious if you are new to this.

Well, this is true for the word abracadabra itself, as well as for menus in spreadsheet software. You know these things only because they were subject to them, and not because they are universal laws. To do something with the computer, you need to ask your local sage about the magic words / gestures, and then practice using them. My beloved sage is called the Internet and knows everything.

image

To speed up learning, don’t just insert magic words – try changing them and see what happens. For example, what will change if you turn TRUE to FALSE in the snippet above?

Isn’t it amazing how fast you get the answer? One of the reasons I love programming is because it is a cross between magic spells and LEGO.

If you ever wanted you to work miracles, just learn how to write code.

Here’s a quick summary about programming: ask the Internet how to do something, take the magic words that you just learned, see what happens when you adjust them, and then put them together like LEGO blocks to execute your code.

Analytics and generalization

The problem with these 27 numbers is that even if they are sorted, they mean little to us. Reading them, we forget what we read a second ago. This is the human brain for you; ask us to read a sorted list of a million numbers, and at best we will remember the last few. We need a quick way to sort and sum so that we can understand what we’re looking at.

That’s what analytics are for!

median(weight)

With the right spell, we can instantly find out what the average weight is. (Median means “average.”)

It turns out the answer is 284g. Who does not like instant gratification? There are all kinds of summary options: min (), max (), mean (), median (), mode (), variance () … try it all! Or try this magic word to find out what is happening.

summary(weight)

By the way, these things are called statistics. Statistics are any way to collect your data. This is not what the statistics area represents – here is an 8 minute introduction to academic discipline.

image

Construction and visualization

This section is not about the type of conspiracy that includes world domination (stay tuned for the news in this article). It’s about summing data using images. It turns out that a picture can be more informative than a thousand words.

image

If we want to know how weights are distributed in our data – for example, are there still items between 0 and 200 g or between 600 and 800 g? – The histogram is our best friend.

image

Histograms are one way (among many) of summing and displaying our sample data. Higher blocks for more popular data values.

Think of histograms as popularity contests.

To create a spreadsheet application, a magic spell is a long series of taps on various menus. In R, this is faster:

Here’s what we got with a single line:

hist(weight)

image

What are we looking at?

On the horizontal axis we have columns. By default, they are installed in 200g increments, but we will change this in a moment. Counts are on the vertical axis: how many times have we seen weight from 0 to 200 g? The graph says 11. How about between 600 g and 800 g? Only one (this is salt, if memory serves).

We can choose the size of our columns – by default, which we got without any fuss with the code – 200 g, but maybe we want to use 100 g instead. No problems! Mages in the learning process can remake my spell to find out how it works.

hist(weight, col = "salmon2", breaks = seq(0, 1000, 100))

Here is the result:

image

Now we can clearly see that the two most common categories are 100–200 and 400–500. Is anyone interested? Probably no. We did this only because we could. A true analyst, on the other hand, excels at the science of quickly browsing data and the art of looking at where interesting nuggets lie. If they are good at their craft, they are worth their weight in gold.

What is distribution?

If these 27 points are all that concern us, then the sample histogram that I have given also reflects the distribution of the population.

This is almost the same as the distribution: this is the histogram that you would get if you applied Hist () to the entire population (to all the information that interests you), and not just to the sample (the data that you have under hand). There are several footnotes, for example, the scale along the Y axis, but we will leave them for another blog post – please don’t hit me, maths!

image

If our population ever packed all food, the distribution would be in the form of a histogram of all their weights. Such a distribution exists only in our imagination as a theoretical idea – some packaged food products are lost for centuries. We cannot make this dataset even if we wanted to, so the best we can do is guess using a good example.

What is Data Science

There are many opinions, but I prefer the following definition: “Data science is the discipline that makes data useful.” Its three subsections include analysis of a large amount of information for searching for insights (analytics), reasonable decision-making based on limited information (statistics) and the use of templates in data for task automation (ML / AI).

The whole science of data comes down to the following: knowledge is power.

The universe is full of information awaiting collection and use. Although our brain is well versed in our realities, it is not so good at storing and processing some types of very useful information.

This is why humanity turned first to clay tablets, then to paper, and, ultimately, to silicon for help. We have developed software to quickly view information, and today people who know how to use it call themselves scientists or data analysts. The true heroes are those who create tools that allow these practitioners to better and faster master the information. By the way, even the Internet is an analytical tool – we just rarely think about it, because even children can conduct such data analysis.

image

Memory upgrade for everyone

Everything that we perceive is stored somewhere, at least temporarily. There is nothing magical about the data, except that it is recorded more reliably than the brain. Some information is useful, some are misleading, the rest is in the middle. The same goes for data.

We are all data analysts and have always been.

We take our amazing biological capabilities for granted and exaggerate the difference between our innate processing of information and automatic diversity. The difference is longevity, speed and scale … but in both cases the same rules of common sense apply. Why do these rules go out the window at the first sign of the equation?

I am glad that we call information fuel for progress, but it makes no sense to worship data as something mystical. It’s better to just talk about data, since we are all data analysts, and always has been. Let’s give everyone the opportunity to see themselves like that.

image

Learn the details of how to get a sought-after profession from scratch or Level Up in skills and salary by taking paid SkillFactory online courses:


Read more

  • 450 free courses from the Ivy League
  • Free Data Science Courses from Harvard University
  • 65 free Machine Learning courses from leading universities in the world
  • 30 life hacks to complete the online course
  • The most successful and most scandalous Data Science project: Cambridge Analytica

Similar Posts

Leave a Reply