We found out what people copy from Stack Overflow and how often

They say that there is some truth in every joke. If we talk about our April Fool’s joke, then in it this share tended to all one hundred percent. We wanted to play with the classic Stack Overflow meme and stray a little from one of our fundamental principles. The source of inspiration is the resources that have spoiled the blood of the founders of the company, which open answers to programming questions only for paid users. How would the world change if we suddenly made the ability to copy text from Stack Overflow available only for money?

Well, just a joke and that’s enough. We hope everyone laughed and no one got scared too much. But wait, we’re still done. By configuring the system to respond to every Command + C input, we realized that we had a chance to get more information about what people are doing on the site. We’ve successfully captured every copy on Stack Overflow for two weeks, and here’s what came out of it.

You are not alone

One in four people who open a question page on Stack Overflow copies something from it within the first five minutes after visiting the site. In total, we counted 40,623,987 copies out of 7,305,042 posts between March 26th and April 9th. People copy text from answers about ten times more often than from questions and about thirty-five times more often than from comments. Blocks of code are copied ten times more often than the accompanying text, and copied from pages of questions without accepted answers is surprisingly more active than where they are.

Accordingly, if you have ever felt ashamed for copying ready-made code instead of writing it from scratch – let your conscience be calm! Why reinvent the wheel if someone has already solved all the difficulties for you? We call this reuse – what was once learned, created, proven by someone else, will now serve you. And there is nothing wrong with that: this way you can learn faster, get working code more quickly, and worry less about it. Our entire site is built around the concept of knowledge reuse – the Stack Overflow community is strong primarily for its altruistic approach to mentoring.

It is entirely permissible to climb on the shoulders of giants and borrow the lessons that they have learned before you in order to create something new and valuable. That being said, it’s worth sticking to some proven practices when copying to inadvertently avoid bugs or security holes, so make sure you figure it out well before grabbing a piece and sticking it in. Well, of course, we must not forget that some code fragments can only be used with licenses. Otherwise, we fully support anyone who wants to benefit from the work created by the community.

As someone who has ripped code off Stack Overflow for years without a twinge of conscience, I was not surprised when copy events started pouring in in the millions. Another thing surprised me: how many answers to different questions this information gave us. How many people are actually copying content from Stack Overflow? Copying just the code or something else? Are you more actively copying questions with accepted answers? To give our analysis some direction, my team and I made a list of questions that interested us. It all started with a simple joke, and turned into a serious study that shed light on many things and gave impetus to numerous discussions about the development and improvement of the platform in the future.

Data

Using a homemade web tracking tool, we created custom events to record every time a user copies something from the site. Thanks to these events, we were able to track a variety of characteristics: tags, content type (question, answer or comment, code block or plain text), reputation of the copying person, post rating, region, post status – accepted or not. In general, we saved almost everything, except for the text itself, which was copied.

We collected the data for a full two weeks, from March 26th to April 9th. All calculations below are related to user behavior during this period.

The top-level results confirmed what sounded like a joke a long time ago: on Stack Overflow, everyone does what they copy. We also quickly became convinced that copying as a type of behavior obeys the same patterns that have already been identified for site traffic. People copy most actively on weekdays, during working hours. The regions where our site enjoys the highest popularity give the most copying: Asia – 33%, Europe – 30% and North America – 26%. And finally, 86% of copying people are anonymous users (that is, they have zero reputation). When we began to delve more into who is copying and what exactly, it became more interesting.

Does high reputation compare with strong copying?

To begin with, we wanted to check: will users with a high reputation turn out to be the most active in copying?

It can be seen from the graph that most of the copying is done by users with zero reputation – that is, anonymous, because anyone who creates an account immediately gets one plus. Perhaps some of these events occur among users who have not logged into their existing account. This, unfortunately, cannot be verified in any way.

Since the bulk of our users has a low reputation, let’s try to remove the breakdown by groups in order to normalize the data. Now let’s turn our attention not to the total number of copies, but to the number of copies per user, to see how the average differs depending on the reputation.

If you examine this visualization, the following pattern can be traced: as reputation grows, the number of copies per user begins to decline. Correlation is present, but not very pronounced, so I cannot say with complete certainty that users with a good or bad reputation are unambiguously copying more actively. Developers who are still developing skills often have low reputations and tend to look for resources that can speed up the learning process. As they accumulate knowledge, they build their reputation and begin to work on tasks that require well-calibrated solutions – these are not always found on Stack Overflow.

Are accepted answers copied more often?

The train of thought here is built like this: since the answer was accepted, it means that it is probably the best, and if so, then it should be copied with redoubled energy. However, if we look at the statistics, we will see that in 52.4% of cases, not accepted answers are copied. However, if we talk about average values, then for one unique post with an accepted answer there are seven copies, and with an unacceptable one only five. It turns out that the unaccepted posts give more copies, but the accepted ones more actively develop the same process of reusing knowledge.

It should be noted that there are also some questions that, in principle, have no accepted answers. Take, for example, here this answer: 4,984 unique users voted for it, and 7,943 copied during our research. But the questioner did not accept him. Yes, and he did not accept any other either – perhaps this is somehow connected with the fact that he has not appeared on the site at all since 2010. But many other helpful answers are in the same position.

Are high-ranking posts copied more actively?

So, the accepted answers have no advantage in copying, but a high rating should definitely have an effect, right? Let’s check.

As we can see, in the category of answers in groups from one to a thousand votes, everything goes pretty smoothly. But in the case of questions, most of the copying occurs on posts with a rating from one to five. I suspect this is because people copy them for repost until they finally get a response.

As in the situation with users, the bulk of the posts on the site has a rather low rating. For normalization, let’s see how many copies are made per post.

Here you can clearly see that the number of copies increases with the rating. And this is logical: the community is more willing to pick up what has already achieved good performance.

Does anyone copy posts with a bad rating?

Well, what about those blue dots, which represent negatively rated posts? Why copy something that no one approves of at all? Well, let’s not jump to conclusions.

Look at this answer… Of all the responses with a negative rating, he collected the maximum number of copies – 288 with a rating of -2. If you read the text, you will notice that it expresses in a more concise manner the same as the most popular answer says, with a rating of 29 and 493 copies in total. Even if the answer with a negative rating did not come out ahead in terms of the number of copies, the principle of “niasilil” here clearly played in its favor.

What tags are copied from most often?

This is the question I most wanted to get an answer to. Unfortunately, due to the scale of the study and the amount of available resources, it was not possible to parse the nested tags. For example, the html tag does not include posts that have a combination of tags | html | css |.

Most often, content was copied from the most popular and active tags on the site, which was not surprising. Only one thing caught my eye: python appears in four groups of tags from the top ten at once. Three of them are directly related to data analysis: | python | pandas |, | python | pandas | dataframe | and | python | matplotlib |. I myself am not indifferent to this topic, so I am very glad that so many people are mastering these tools.

Top 10 Tags, Now With Copies Per Post

In addition to the tags with the highest total copy count, I wanted to compute the tags with the highest copy-to-post ratio. I set a minimum threshold of ten posts, and as you can see, it turned out that the more specifics in the tags, the more copies they collect per post.

What posts have been copied the most?

Well, now let’s move on to what, I think, arouses curiosity among many. Which post got the most copies?

Reply with a block of code

I am glad to announce that the answer to the question is the winner. How to iterate over rows in a DataFrame in Pandaswith 3,497 votes and 11,829 copies. It was posted in 2013 and still continues to bail out thousands of people every week.

Plain text response

If we talk about content that does not contain code, here comes a post on the topic TypeError: this.getOptions is not a function [closed] with 218 votes and 1,570 copies. There is no way to check, but I assume they are copying the `sass-loader @ 10.1.1` snippet.

Code block question

Among the questions we have in the lead How to create an HTML button that acts like a link? – 2,147 votes and 3,665 copies.

Plain text question

And finally, the most popular question without a code turned out to be Updates were rejected because the tip of your current branch is behind its remote counterpart – 322 votes and 261 copies. There are difficulties with it, because the text contains many git commands that are not formatted as code blocks – perhaps they are actively copied. But since the text itself, which was copied, we did not save, no one will ever know.

Comments (1)

It’s important to remember that Stack Overflow isn’t all about questions and answers. Sometimes one sensible comment is enough. Here are a couple of those that were copied especially actively!

First Is the absolute leader among comments throughout the site, and second – a dark horse: he collected only five votes, but he ranks sixth in terms of the number of copies.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *