Identifying the Top 10 Most Popular Python Standard Libraries Using GitHub Commit Data
This article describes how to collect commit data from five popular Python GitHub repositories with Pydriller, parse source files into abstract syntax trees, extract standard library imports, compare pre‑ and post‑commit usage, and visualize the ten most frequently used standard libraries.
Python is widely used in AI and data science, and its extensive standard library makes it accessible even to those without a software‑engineering background. The goal of this study is to discover the ten most useful Python standard libraries by analyzing commit data from five well‑known Python projects on GitHub.
Data collection is performed with the pydriller package, which efficiently extracts commit information. The following command installs the library: pip install pydriller Five repositories (Django, Pandas, NumPy, Home‑Assistant, and system‑design‑primer) are mined for a one‑year period, and for each commit the source code before and after the change is stored in a DataFrame tf_source with columns commit_ID, before_Commit, and after_Commit.
To identify imported standard libraries, source files are parsed into an abstract syntax tree (AST). A custom visitor class extracts import and import from nodes:
#import libraries<br>import ast<br>import tokenize<br>class FuncParser(ast.NodeVisitor):<br> def visit_Import(self, node):<br> tempImpo = node.names<br> if(tempImpo != None):<br> listImpo = tempImpo[0]<br> Impo = listImpo.name<br> if (Impo in api_name):<br> file_contents.append(Impo)<br> ast.NodeVisitor.generic_visit(self, node)<br> def visit_ImportFrom(self, node):<br> module=node.module<br> if(module in api_name):<br> file_contents.append(module)<br> else:<br> ast.NodeVisitor.generic_visit(self, node)<br> def generic_visit(self, node):<br> ast.NodeVisitor.generic_visit(self, node)For each commit, the AST of the pre‑commit and post‑commit source files is traversed with FuncParser(), producing two token frequency tables ( tokens_before and tokens_after). The difference between these tables reveals which libraries were added or removed by the commit:
diff = tokens_after.subtract(tokens_before)<br>diff_token = diff[(diff.select_dtypes(include=['number']) != 0).any(1)]<br>diff_token = diff_token.fillna(0).abs().reset_index()The differences are aggregated across all commits into a dictionary py_lib that counts the total occurrences of each library:
py_lib = {}<br>for j in range(len(diff_token)):<br> word = diff_token['index'][j].lower()<br> if word in py_lib:<br> py_lib[word] += diff_token['token'][j]<br> else:<br> py_lib[word] = 1The ten most frequent libraries are then extracted and displayed:
from operator import itemgetter<br>d = sorted(py_lib.items(), key=itemgetter(1), reverse=True)[:10]<br># Example result<br>[('warnings', 96.0), ('sys', 73.0), ('datetime', 28.0), ('test', 27.0), ('os', 22.0), ('collections', 18.0), ('io', 16.0), ('gc', 10.0), ('functools', 9.0), ('threading', 7.0)]Finally, a word‑cloud visualisation of library frequencies is generated using matplotlib and wordcloud:
import matplotlib.pyplot as plt<br>from wordcloud import WordCloud<br>wordcloud = WordCloud(background_color='black', max_font_size=50)<br>wordcloud.generate_from_frequencies(frequencies=py_lib)<br>plt.figure(figsize=(8,6))<br>plt.imshow(wordcloud, interpolation='bilinear')<br>plt.axis('off')<br>plt.show()The analysis shows that libraries such as warnings, sys, and datetime dominate the top‑10 list, confirming their widespread use across major Python projects.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
