This article shows the results of a case study that researches how popular Python is as a programming language based on usage across GitHub.
Python is an object-oriented language and it’s becoming more and more popular among developers because it allows creating a variety of tools and applications. From automation tools to web apps, from Data Science to Artificial Intelligence applications. One component that contributes to the versatility of Python is the number of third-party modules and frameworks available.
In this case study we will compare multiple programming languages to see which ones are the most popular based on GitHub repository data.
Then we will look at trends related to Python modules and web frameworks.
The best part is that we will automate the retrieval of data from GitHub using Python 😀
Let’s start analyzing how languages are used in GitHub!
Retrieve Data from GitHub API Using the Python Requests Module
We will start by writing a simple Python program to retrieve the number of code repositories in GitHub that match a specific search.
Below you can see the output of a search in GitHub for the word “python”:
How can we do this search programmatically?
We can use the GitHub API and the Python requests module.
Here you can see how you can retrieve the same information in the screenshot above from the GitHub API:
https://api.github.com/search/repositories?q=python
To perform the search, we are using the /search/repositories
endpoint of the API and we are passing the query string q=python.
The API returns a JSON object and the only thing we are interested in for this case study is the number of repositories returned by the search: the total_count attribute.
The following Python code gets the response from the API using the requests module and prints the value of the total_count:
import requests
url = "https://api.github.com/search/repositories?q=python"
data = requests.get(url).json()
repos_count = data['total_count']
print("Number of repositories: {}".format(repos_count))
[output]
Number of repositories: 1803899
Nice, it works!
Which Ones Are the Most Popular Languages on GitHub?
Now, it’s time to find out which programming languages are the most popular based on the number of results from GitHub.
To do that let’s create a list that contains the programming languages we will be comparing. We will use this list to get the number of repository results from GitHub for each language.
import requests, time
def search_github(keyword):
url = "https://api.github.com/search/repositories?q={}".format(keyword)
data = requests.get(url).json()
repos_count = data['total_count']
return repos_count
def retrieve_repositories_results(keywords):
repos_results = {}
for keyword in keywords:
repos_count = search_github(keyword)
print("{} repositories results found: {}".format(keyword, repos_count))
repos_results[keyword] = repos_count
time.sleep(3)
return repos_results
IMPORTANT: The 3 seconds of sleep at every iteration of the for loop in the retrieve_repositories_results() function is needed because GitHub will limit the number of requests allowed in a specific period. In case of failures increase the sleep time.
The function retrieve_repositories_results() takes a list as a parameter, in this case, a list of programming languages, and for each of them retrieves the number of repository results from GitHub.
The results for all the languages are then stored in the Python dictionary repos_results. The keys of the dictionary are the programming languages and the values are the number of repository results for each language.
You can call the function retrieve_repositories_results() using the code below:
languages = ['Python', 'Java', 'Ruby', 'Javascript', 'PHP', 'Objective-C', 'Golang', 'Bash', 'Rust', 'Powershell']
languages_results = retrieve_repositories_results(languages)
This is the output you get back:
Python repositories results found: 1803956
Java repositories results found: 1704611
Ruby repositories results found: 339333
Javascript repositories results found: 879907
PHP repositories results found: 658894
Objective-C repositories results found: 24158
Golang repositories results found: 153858
Bash repositories results found: 94572
Rust repositories results found: 113532
Powershell repositories results found: 43552
Then you can use the Pandas module to print this data as a table. Add an import statement for Pandas and a function that prints the Pandas dataframe created from our dictionary.
import requests, time
import pandas as pd
[ No changes required for the search_github() and retrieve_repositories_results() functions ]
def print_repos_results(repos_results):
df = pd.DataFrame(repos_results, index=['Repository results'])
print(df)
languages = ['Python', 'Java', 'Ruby', 'Javascript', 'PHP', 'Objective-C', 'Golang', 'Bash', 'Rust', 'Powershell']
languages_results = retrieve_repositories_results(languages)
print_repos_results(languages_results)
We will be using Jupyter Notebook to output a table that contains all the stats.
That’s cool, but how can we make these results easier to read?
Creating a Bar Chart with Mathplotlib to Show the Most Popular Programming Languages
We will use the Mathplotlib library to create a bar chart of the data we have collected so far.
To generate bars with random colors you can use Python’s random module. Define the following functions to generate random colors and draw the graph:
import matplotlib.pyplot as plt
def generate_random_colors(number_of_colors):
colors = []
for x in range(number_of_colors):
rgb = (random.random(), random.random(), random.random())
colors.append(rgb)
return colors
def print_graph(repos_results, graph_type, title):
keywords = repos_results.keys()
results = repos_results.values()
plt.figure(figsize=(9, 3))
colors = generate_random_colors(len(keywords))
if graph_type == "bar":
plt.bar(keywords, results, color=colors)
else:
plt.scatter(keywords, results, color=colors)
plt.suptitle(title)
plt.show()
To see the graph you can call the print_graph() function:
print_graph(languages_results, "bar", "Programming Languages")
You can see that Python is the most popular programming language followed by Java.
It’s very interesting to see the difference between Python / Java and other programming languages. It gives you a rough idea of programming trends at this moment in time.
You can update the list of the programming languages passed to our program to get stats related to any languages you are interested in.
What Are The Most Popular Python Modules on GitHub?
In the next part of this research, we will focus on Python. We want to know what are the most popular Python modules.
The list of modules used in this case study is just an example and it can contain as many modules as you want.
The principle is having enough data to understand which Python modules might be worth learning to get up to speed with market trends.
This time we will apply a small change to the search done via the GitHub API. We will pass a search term in the same way we have done before and we will also specify the language we are interested in:
https://api.github.com/search/repositories?q=pandas+language:python
Let’s update our code to make it more generic, so it can handle searches with and without filtering based on the language.
Update the search_github() and retrieve_repositories_results() Python functions to handle an optional parameter called language_filter:
def search_github(keyword, language_filter=None):
if language_filter:
url = "https://api.github.com/search/repositories?q={}+language:{}".format(keyword, language_filter)
else:
url = "https://api.github.com/search/repositories?q={}".format(keyword)
data = requests.get(url).json()
repos_count = data['total_count']
return repos_count
def retrieve_repositories_results(keywords, language_filter=None):
repos_results = {}
for keyword in keywords:
repos_count = search_github(keyword, language_filter)
print("{} repositories results found: {}".format(keyword, repos_count))
repos_results[keyword] = repos_count
time.sleep(3)
return repos_results
And now let’s see what are some of the most used Python modules:
modules = ['Pandas', 'NumPy', 'Tkinter', 'Pytest', 'Celery', 'Matplotlib', 'SciPy', 'lxml']
modules_results = retrieve_repositories_results(modules, 'Python')
Pandas repositories results found: 11559
NumPy repositories results found: 11935
Tkinter repositories results found: 20600
Pytest repositories results found: 6894
Celery repositories results found: 4336
Matplotlib repositories results found: 8212
SciPy repositories results found: 1786
lxml repositories results found: 514
And the winner is Tkinter!
Notice also how similar is the usage of the Pandas and NumPy modules.
This is a short list of modules, but it’s a starting point to show you how to retrieve this type of data.
What is The Most Popular Python Web Framework?
Let’s do a similar analysis with a list of Python web frameworks to understand which ones are the most commonly used.
The good news is that we don’t need to change anything in our code. We just have to provide a list of frameworks and pass it to the existing functions to:
- Retrieve the number of repositories in GitHub for the framework name and the Python programming language.
- Draw a graph that summarises the data (this time we will generate a scatter plot instead of a bar chart).
frameworks = ['Django', 'Flask', 'Tornado', 'CherryPy', 'web2py', 'Pylons', 'AIOHTTP', 'Bottle', 'Falcon']
frameworks_results = retrieve_repositories_results(frameworks, 'Python')
Django repositories results found: 251326
Flask repositories results found: 114350
Tornado repositories results found: 4603
CherryPy repositories results found: 561
web2py repositories results found: 915
Pylons repositories results found: 157
AIOHTTP repositories results found: 1694
Bottle repositories results found: 2323
Falcon repositories results found: 1210
Here is the scatter plot that represents the data:
print_graph(frameworks_results, "scatter", "Python Frameworks")
You can see how popular Django and Flask are compared to other web application frameworks.
Let’s also analyze the trends for Django and Flask in the past 5 years worldwide. To do that you can use Google Trends.
You can see that Google Trends confirms that Django is more popular than Flask. At the same time, it looks like there has been an increasing interest in Flask over time.
It’s also interesting to see how the popularity of both frameworks seems to be going down recently.
Conclusion
In this case study we have used real GitHub statistics to compare the popularity of:
- Programming languages
- Python modules
- Python web frameworks
GitHub language trends, at the time this data has been retrieved, suggest that:
- Python is the top language (followed by Java)
- Tkinter is the most used module among the ones we selected
- Django is the top web framework among the ones we selected
To retrieve and visualize the data we have used the requests module, the Pandas module, and the Matplotlib library.
You can download the full code for this case study here.
Claudio Sabato is an IT expert with over 15 years of professional experience in Python programming, Linux Systems Administration, Bash programming, and IT Systems Design. He is a professional certified by the Linux Professional Institute.
With a Master’s degree in Computer Science, he has a strong foundation in Software Engineering and a passion for robotics with Raspberry Pi.