I’ve parsed over 1000 top machine learning Github profiles and here’s what I learned.

When I searched for the keyword “machine learning” I found 246,632 machine learning repositories. Since they are all related to this industry, I expected their owners to be experts or at least competent enough in machine learning. Therefore, I decided to analyze the profiles of these users and show the results of the analysis.

How i worked

Tools

I used three scraping tools:

  • Beautiful soup to fetch the URL of all repositories tagged “machine learning”. It is a Python library that makes scraping much easier.
  • PyGithub allows you to retrieve information about users. This is another Python library for use with Github API v3. It allows you to manage various Github resources (repositories, user profiles, organizations, etc.) using Python scripts.
  • Requests to retrieve information about repositories and links to contributor profiles.

Methods

I did not parse all of them, but only the owners and 30 most active contributors of the 90 top repositories that appeared in the search results.

After removing duplicates and profiles of organizations like udacity, I got a list of 1208 users. For each of them, I parsed the information for 20 key parameters.

new_profile.info ()

The first 13 parameters were obtained from here.

The rest I took from the user’s repositories:

  • total_stars total stars of all repositories
  • max_star maximum number of stars of all repositories
  • forks total number of forks of all repositories
  • descriptions descriptions from all user repositories of all repo
  • contribution number of contributions for the last year

Visualizing data

Histograms

After cleansing the data, the most interesting stage came: data visualization. I used Plotly for this.

import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px # for plotting
import altair as alt # for plotting
import datapane as dp # for creating a report for your findings
 
top_followers = new_profile.sort_values(by='followers', axis=0, ascending=False)
 
fig = px.bar(top_followers,
             x='user_name',
             y='followers',
             hover_data=['followers'],
            )
fig.show()

That’s what happened

The histogram is a little awkward because it has a very long tail of users with less than 100 followers. So it’s better to scale it up.

As you can see, llSourcell (Siraj Raval) has the most followers (36261). The second most popular has three times less followers (12682).

We can go ahead and find out that 1% of profiles got 41% of all followers!

>>> top_n = int(len(top_followers) * 0.01)12>>> sum(top_followers.iloc[0: top_n,:].loc[:, 'followers'])/sum(top_followers.followers)0.41293075864408607

Next, we visualize information on total_stars, max_star, forks using a logarithmic scale.

figs = [] # list to save all the plots and table
 
features = ['followers',
               'following',
               'total_stars',
               'max_star',
               'forks',
                'contribution']
for col in features:
    top_col = new_profile.sort_values(by=col, axis=0, ascending=False)
    
    log_y = False 
    #change scale of y-axis of every feature to log except contribution
    if col != 'contribution':
        log_y = True
        
    fig = px.bar(top_col,
             x='user_name',
             y=col,
             hover_data=[col],
             log_y = log_y
            )
    
    fig.update_layout({'plot_bgcolor': 'rgba(36, 83, 97, 0.06)'}) #change background coor
    
    fig.show()
    
    figs.append(dp.Plot(fig))

It turns out like this

The resulting picture is very close to the distribution according to Zipf’s law. We are talking about the empirical pattern of the distribution of the frequency of words in a natural language: if all words of the language are ordered in descending order of frequency of their use. We have a similar dependence here.

Correlation

But what about the dependencies between key data points? And how strong are these dependencies? I used the scatter_matrix to find out.

correlation = px.scatter_matrix(new_profile, dimensions=['forks', 'total_stars', 'followers',
                                 'following', 'max_star','contribution'],
                               title="Correlation between datapoints",
                               width=800, height=800)
correlation.show()
 
corr = new_profile.corr()
 
figs.append(dp.Plot(correlation))
figs.append(dp.Table(corr))
corr

It turns out like this and still so

The strongest positive relationships are formed between:

  • Maximum number of stars and total number of stars (0.939)
  • Forks and total stars (0.929)
  • The number of forks and the number of followers (0.774)
  • Followers and total stars (0.632)

Programming languages

In order to find out which programming languages ​​are most common among GitHub profile owners, I did some additional analysis.

# Collect languages from all repos of al users
languages = []
for language in list(new_profile['languages']):
    try:
        languages += language
    except:
        languages += ['None']
        
# Count the frequency of each language
from collections import Counter
occ = dict(Counter(languages))
 
# Remove languages below count of 10
top_languages = [(language, frequency) for language, frequency in occ.items() if frequency > 10]
top_languages = list(zip(*top_languages))
 
language_df = pd.DataFrame(data = {'languages': top_languages[0],
                           'frequency': top_languages[1]})
 
language_df.sort_values(by='frequency', axis=0, inplace=True, ascending=False)
 
language = px.bar(language_df, y='frequency', x='languages',
      title="Frequency of languages")
 
figs.append(dp.Plot(language))
 
language.show()

Accordingly, the top 10 languages ​​include:

  • Python
  • JavaScript
  • Html
  • Jupyter Notebook
  • Shell, etc.

Location

In order to understand in which parts of the world the profile owners are located, you need to perform the following task – to visualize the location of users. Among the analyzed profiles, geography is indicated for 31%. For visualization we use geopy.geocoders.Nominatim

from geopy.geocoders import Nominatim
import folium
 
geolocator = Nominatim(user_agent="my_app")
 
locations = list(new_profile['location'])
 
# Extract lats and lons
lats = []
lons = []
exceptions = []
 
for loc in locations:
    try:
        location = geolocator.geocode(loc)
        lats.append(location.latitude)
        lons.append(location.longitude)
        print(location.address)
    except:
        print('exception', loc)
        exceptions.append(loc)
        
print(len(exceptions)) # output: 17
 
 # Remove the locations not found in map
location_df = new_profile[~new_profile.location.isin(exceptions)]
 
location_df['latitude'] = lats
location_df['longitude'] = lons

Well, then, to build a map, use Plotly’s scatter_geo

# Visualize with Plotly's scatter_geo
m = px.scatter_geo(location_df, lat="latitude", lon='longitude',
                 color="total_stars", size="forks",
                 hover_data=['user_name','followers'],
                 title="Locations of Top Users")
m.show()
 
figs.append(dp.Plot(m))

By this link the original map with zoom is available.

Description of repo and bio users

Many users leave a description for their repositories and also provide their own bio. In order to visualize all this, we use WordCloud! for Python.

import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
 
      
def process_text(features):
  '''Function to process texts'''
    
    features = [row for row in features if row != None]
    
    text=" ".join(features)
    
    
    
    # lowercase
    text = text.lower()
 
    #remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
 
    #remove stopwords
    stop_words = set(stopwords.words('english'))
 
    #tokenize
    tokens = word_tokenize(text)
    new_text = [i for i in tokens if not i in stop_words]
    
    new_text=" ".join(new_text)
    
    return new_text
 
def make_wordcloud(new_text):
  '''Function to make wordcloud'''
    
    wordcloud = WordCloud(width = 800, height = 800,
                background_color="white",
                min_font_size = 10).generate(new_text)
 
    
    fig = plt.figure(figsize = (8, 8), facecolor = None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad = 0)
 
    plt.show()
    
    return fig
    
descriptions = []
for desc in new_profile['descriptions']:
    try:
        descriptions += desc
        
    except:
        pass
 
descriptions = process_text(descriptions)
 
cloud = make_wordcloud(descriptions)
 
figs.append(dp.Plot(cloud))

And the same for bio

bios = []
for bio in new_profile['bio']:
    try:
        bios.append(bio)
        
    except:
        pass
      
text = process_text(bios)
 
cloud = make_wordcloud(text)
 
figs.append(dp.Plot(cloud))

As you can see, the keywords are quite consistent with what you can expect from machine learning specialists.

findings

The data was received from users and authors of 90 repositories with the best match for the key “machine learning”. But there is no guarantee that all top profile owners are on the list by machine learning experts.

Nevertheless, this article is a good example of how the collected data can be cleaned and visualized. Most likely, the result will surprise you. And this is not strange, since data science helps to apply your knowledge to analyze your surroundings.

Well, if necessary, you can fork the code of this article and do whatever you want with it, here is the repo Commentary on the article by Vyacheslav Arkhipov

, Data Science Specialist for AR startup Banuba and Curriculum Consultant for Skillbox Online University.

An interesting study showing all the steps of statistical analysis of the “market” of github profiles on machine learning. Similarly, one could analyze social networks, the demand for goods, the preferences of moviegoers, etc.

I would like to draw your attention to two facts:

1) despite the author’s assertion that a logarithmic scale is used to visualize profile indicators, in fact, linear scales are used. This can be seen from the above listing (it follows that the logarithmic scale was used only for ‘contribution’). All graphs look like they were drawn on a linear scale.

But the distribution of the ‘contribution’ indicator itself is really similar to the Pareto distribution, known as the Zipf distribution when applied to the frequency of occurrence of words. Graphs of other indicators do not very much resemble the Pareto distribution to me.

Here I would like to see the use of Pearson’s test to assess the likelihood that the studied data have a certain distribution. Moreover, there is even a ready-made hypothesis that this is a Pareto (Zipf) distribution.

2) It seems to me that it would be interesting to expand on the topic of correlation of indicators. The author has noted indicators that are clearly highly correlated. But, after a small transformation, the correlation coefficient can be turned into a value that has a Student’s distribution. This means that you can conduct a study on the significance of the correlation coefficient by accepting the null hypothesis of linear independence of the parameters. Considering the large amount of studied data (degrees of freedom), the threshold for rejecting the null hypothesis will be quite low, and it will become obvious that many more parameters have, albeit small in value, but statistically significant correlation.

And the next step is to try to estimate the dimension: to understand how many parameters can describe one data point. And also find these hidden parameters using the principal component method.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *