How Python Is Used: An Analysis Of GitHub Trends

In this article I will show you the results of a case study I created that looks at how Python is used based on stats retrieved from GitHub.

Python is an object-oriented language and it’s becoming more and more popular because it allows to create a variety of tools and applications. From automation tools to web apps, from data science to artificial intelligence applications. One component that contributes to the versatility of Python is the amount of third party modules and frameworks available.

In this case study I will do a comparison between multiple programming languages that will show which languages are the most popular based on GitHub repository data.

Then we will look at trends related to Python modules and web frameworks.

And the best part is that we will automate the retrieval of data from GitHub using Python 🙂

Let’s get started!

Retrieve Data from GitHub With the Requests Module

We will start by writing a simple Python program to retrieve the number of code repositories in GitHub that matches a specific search.

Below you can see the output of a search in GitHub for the word “python”:

How can we do this programmatically?

We can use the GitHub API and the Python requests module.

Here you can see how we can retrieve the same information in the screenshot above from the GitHub API:

https://api.github.com/search/repositories?q=python

To perform the search, we are using the /search/repositories endpoint of the API and we are passing the query string q=python.

The API returns a JSON object and the only thing we are interested in for this case study is the number of repositories returned by the search: the total_count attribute.

The following Python code gets the response from the API using the requests module and prints the value of the total_count:

import requests
   
url = "https://api.github.com/search/repositories?q=python"
data = requests.get(url).json()
repos_count = data['total_count']
print("Number of repositories: {}".format(repos_count))

[output]
Number of repositories: 1803899 

Nice!

Which Programming Language Is The Most Used?

Now, it’s time to find out which programming languages are the most used based on the number of results from Github.

To do that I have created a list that contains the programming languages we will be comparing. We will use this list to get the number of repository results from GitHub for each language.

import requests, time 

def search_github(keyword):
    url = "https://api.github.com/search/repositories?q={}".format(keyword)
    data = requests.get(url).json()
    repos_count = data['total_count']
    return repos_count

def retrieve_repositories_results(keywords):
    repos_results = {}

    for keyword in keywords:
        repos_count = search_github(keyword)
        print("{} repositories results found: {}".format(keyword, repos_count))
        repos_results[keyword] = repos_count
        time.sleep(3)

    return repos_results 

IMPORTANT: The 3 seconds sleep at every iteration of the for loop in the retrieve_repositories_results() function is needed because GitHub will limit the amount of requests allowed in a specific period of time. In case of failures increase the sleep time.

The function retrieve_repositories_results() takes as parameter a list, in this case a list of programming languages, and for each of them retrieves the number of repository results from GitHub.

The results for all the languages are then stored in the Python dictionary repos_results. The keys of the dictionary are the programming languages and the values are the number of repository results for each language.

We can call the retrieve_repositories_results() function using the code below:

languages = ['Python', 'Java', 'Ruby', 'Javascript', 'PHP', 'Objective-C', 'Golang', 'Bash', 'Rust', 'Powershell']
languages_results = retrieve_repositories_results(languages) 

This is the output we get back:

Python repositories results found: 1803956
Java repositories results found: 1704611
Ruby repositories results found: 339333
Javascript repositories results found: 879907
PHP repositories results found: 658894
Objective-C repositories results found: 24158
Golang repositories results found: 153858
Bash repositories results found: 94572
Rust repositories results found: 113532
Powershell repositories results found: 43552 

Then we can use the Pandas module to print this data as a table. Add an import for pandas and a function that prints the Pandas dataframe created from our dictionary.

import requests, time
import pandas as pd

[ No changes required for the search_github() and retrieve_repositories_results() functions ]

def print_repos_results(repos_results):
    df = pd.DataFrame(repos_results, index=['Repository results'])
    print(df)

languages = ['Python', 'Java', 'Ruby', 'Javascript', 'PHP', 'Objective-C', 'Golang', 'Bash', 'Rust', 'Powershell']
languages_results = retrieve_repositories_results(languages)
print_repos_results(languages_results)

I will be using Jupyter Notebook to output a table that contains all the stats.

That’s cool, but how can we make these results easier to read?

Creating a Bar Chart with Mathplotlib

We will use the Mathplotlib library to create a bar chart of the data we have collected so far.

To generate bars with random colours we will use the Python random module.

Define the following functions to generate random colours and draw the graph:

import matplotlib.pyplot as plt

def generate_random_colors(number_of_colors):
    colors = []

    for x in range(number_of_colors):
        rgb = (random.random(), random.random(), random.random())
        colors.append(rgb)

    return colors

def print_graph(repos_results, graph_type, title):
    keywords = repos_results.keys()
    results = repos_results.values()

    plt.figure(figsize=(9, 3))
    colors = generate_random_colors(len(keywords))

    if graph_type == "bar":
        plt.bar(keywords, results, color=colors)
    else:
        plt.scatter(keywords, results, color=colors)

    plt.suptitle(title)
    plt.show() 

To see the graph we will call the print_graph() function:

print_graph(languages_results, "bar", "Programming Languages") 

You can see that Python is the most popular programming language followed by Java.

It’s very interesting to see the difference between Python / Java and other programming languages. It can give you a rough idea of today’s programming trends.

You can update the list of the programming languages passed to our program to get stats related to any languages you are interested in.

What Are The Most Popular Python Modules?

In the next part of this research we focus on Python.

We want to know what are the most popular Python modules.

The list of modules used in this case study is just an example and it can contain as many modules as you want.

The principle is having enough data to understand which Python modules might be worth learning to get up to speed with market trends.

This time we will apply a small change to the search done via the GitHub API. We will pass a search term in the same way we have done before and we will also specify the language we are interested in:

https://api.github.com/search/repositories?q=pandas+language:python

Let’s update our code to make it more generic, so it can handle searches with and without filtering based on the language.

Update the search_github() and retrieve_repositories_results() functions to handle an optional parameter called language_filter:

def search_github(keyword, language_filter=None):
    if language_filter:
        url = "https://api.github.com/search/repositories?q={}+language:{}".format(keyword, language_filter)
    else:
        url = "https://api.github.com/search/repositories?q={}".format(keyword)

    data = requests.get(url).json()
    repos_count = data['total_count']
    return repos_count

def retrieve_repositories_results(keywords, language_filter=None):
    repos_results = {}

    for keyword in keywords:
        repos_count = search_github(keyword, language_filter)
        print("{} repositories results found: {}".format(keyword, repos_count))
        repos_results[keyword] = repos_count
        time.sleep(3)

    return repos_results 

And now let’s see what are some of the most used Python modules…

modules = ['Pandas', 'NumPy', 'Tkinter', 'Pytest', 'Celery', 'Matplotlib', 'SciPy', 'lxml']
modules_results = retrieve_repositories_results(modules, 'Python')
Pandas repositories results found: 11559
NumPy repositories results found: 11935
Tkinter repositories results found: 20600
Pytest repositories results found: 6894
Celery repositories results found: 4336
Matplotlib repositories results found: 8212
SciPy repositories results found: 1786
lxml repositories results found: 514 

And the winner is…

…Tkinter!

Notice also how similar is the usage of the Pandas and NumPy modules.

Obviously this is a very limited list, but it’s a starting point to show you how to retrieve this type of data.

What Is The Most Popular Python Web Framework?

Let’s do a similar analysis with a list of Python web frameworks to understand which ones are the most commonly used.

The good news is that we don’t need to change anything in our code. We just have to provide a list of frameworks and pass it to the existing functions to:

  • Retrieve the number of repositories in GitHub for the framework name and the Python programming language.
  • Draw a graph that summarises the data (this time we will generate a scatter plot instead of a bar chart).
frameworks = ['Django', 'Flask', 'Tornado', 'CherryPy', 'web2py', 'Pylons', 'AIOHTTP', 'Bottle', 'Falcon']
frameworks_results = retrieve_repositories_results(frameworks, 'Python') 
Django repositories results found: 251326
Flask repositories results found: 114350
Tornado repositories results found: 4603
CherryPy repositories results found: 561
web2py repositories results found: 915
Pylons repositories results found: 157
AIOHTTP repositories results found: 1694
Bottle repositories results found: 2323
Falcon repositories results found: 1210 

And here is the scatter plot that represents the data:

print_graph(frameworks_results, "scatter", "Python Frameworks") 

You can see how popular Django and Flask are compared to other web application frameworks.

I also want to see what has been the trend for Django and Flask in the past 5 years worldwide. To do that we can use Google Trends.

You can see that Google Trends confirms that Django is more popular than Flask. At the same time it looks that there has been an increasing interest in Flask over time.

It’s also interesting to see how the popularity of both frameworks seems to be going down recently.

Conclusion

In this case study we have used real data coming from GitHub to compare the popularity of:

  • Programming languages.
  • Python modules.
  • Python web frameworks.

We have seen that Python is the most popular language (together with Java).

Tkinter is the most used module and Django is the top web framework.

To pull and graph the data we have used the requests module, the Pandas tool and the Matplotlib library.

You can download the full code for this case study here.

Share knowledge with your friends!

Leave a Reply

Your email address will not be published. Required fields are marked *