How to Build a GitHub Code Leak Detector with Python – Real‑World Security Monitoring
This tutorial walks you through creating a Python‑based GitHub monitoring tool that logs in, crawls code search results for sensitive keywords, extracts repository details, writes findings to CSV, and sends email alerts, providing a practical approach to detecting accidental source‑code leaks.
0×00 Background
It is well known that GitHub is a hot target for security personnel and hackers because many developers unintentionally expose their code, which can pose security risks to enterprises.
For example, code may contain sensitive information such as usernames, passwords, database credentials, internal IPs, even personal details. Therefore, monitoring GitHub for information leaks is necessary. Existing open‑source tools are not suitable, so a custom tool is built.
0×01 Roll Up Your Sleeves
Life is short, I use Python!
Python’s powerful libraries, concise syntax and rapid development make it ideal for this project.
Principle and Steps
GitHub does not provide a searchable API, so we use a web crawler to fetch pages, parse the results, and extract the needed information.
Login to GitHub; Query keyword results; Email alert; Read configuration file.
Development Environment and Python Libraries
Environment: macOS 10.12.6, Python 3.6.5
Libraries: requests, lxml, csv, tqdm, email, smtplib, configparser, time
0×02 Step Analysis
1. Login to GitHub
Login requires a POST request to https://github.com/session with parameters including authenticity_token, login and password. The token is extracted from the login page using XPath.
def login_github(username, password):
login_url = 'https://github.com/login'
session_url = 'https://github.com/session'
try:
s = requests.session()
resp = s.get(login_url).text
dom_tree = etree.HTML(resp)
key = dom_tree.xpath('//input[@name="authenticity_token"]/@value')
user_data = {
'commit': 'Sign in',
'utf8': '✓',
'authenticity_token': key,
'login': username,
'password': password
}
s.post(session_url, data=user_data)
s.get('https://github.com/settings/profile')
return s
except:
print('Exception, check network and credentials')2. Query Keyword and Render Results
After login, construct a search URL like https://github.com/search?p={page}&q={keyword}&type=Code, fetch the page, and parse repository URLs, usernames, upload times, and filenames using XPath.
# Example snippet extracting URLs
Urls = dom_tree_code.xpath('//div[@class="d-inline-block col-10"]/a[2]/@href')
users = dom_tree_code.xpath('//a[@class="text-blod"]/text()')
datetime = dom_tree_code.xpath('//relative-time/text()')
filename = dom_tree_code.xpath('//div[@class="d-inline-block col-10"]/a[2]/text()')3. Email Alert
The tool sends an email with the list of leaked repositories. The email body includes the matched payload, URL, and code snippet.
def send_warning(host, username, password, sender, receivers, content):
# Build MIME message and send via SMTP
...4. Configuration File Reading
A simple INI file stores the keyword, GitHub credentials, email settings, and custom payloads. The main function reads this file and passes the values to the hunter function.
[KEYWORD]
keyword = your main keyword here
[EMAIL]
host = smtp.example.com
user = [email protected]
password = your_password
[PAYLOADS]
p1 = password
p2 = username0×03 Monitoring Result
1. Run Output
2. Email Alert
0×04 Summary
The tool first searches GitHub with a main keyword (e.g., company domain, email, name), then scans the results for user‑defined payloads such as password, username, database, etc. Combined with cron, it can run daily and send alerts. The full source code is available on GitHub.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
