Python Web Scraping of Fund Holdings Data and Analysis Using Requests, Selenium, and MongoDB
This tutorial demonstrates how to analyze a fund ranking website, construct dynamic URLs, extract six‑digit fund codes, crawl fund holding pages with requests and Selenium, store the results in MongoDB, and finally process the data to identify the most frequently held stocks across thousands of funds.
The article explains how to obtain fund holding data by first analyzing the target website (天天基金网) and discovering that fund detail URLs are composed of the fund code combined with a base URL, while holding‑detail pages follow a similar pattern using http://fundf10.eastmoney.com/ccmx_ plus the six‑digit code.
Because the data is loaded via JavaScript, the author shows that the request URL captured in the browser’s network panel can be fetched directly with requests , and a regular expression is used to extract the six‑digit fund codes from the response.
The crawling workflow is broken into three steps: (1) retrieve the list page and extract fund codes, (2) build the holding‑detail URLs and fetch each page (using Selenium with explicit waits when necessary), and (3) store the fund URL, name, and retrieved stock list into a MongoDB collection.
Required libraries are listed and imported as follows: import re from lxml import etree from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import pymongo
Helper functions are provided: is_contain_chinese checks for Chinese characters, is_element verifies element existence with Selenium, get_one_page performs a GET request with headers and optional proxy, page_url builds the list of fund URLs and names, and hold_a_position uses Selenium to extract stock names from the holding‑detail page.
The main script connects to MongoDB, iterates over the fund URLs, calls hold_a_position , assembles a dictionary with fund_url , fund_name , and stock_name , inserts each document into the tb_stock collection, and finally queries for non‑null stock entries.
Data processing is then performed: all stock name arrays are merged into list_stock_all , duplicates are removed to create list_stock_repetition , and stocks appearing more than ten times are stored in a new collection tb_data with fields name and numbers . Sample output documents show counts such as "中国石化" appearing in 54 funds and "招商银行" in 910 funds.
The conclusion highlights that the project uses basic web‑scraping techniques (requests, regex, xpath) together with Selenium for dynamic pages and MongoDB for persistence, providing a practical example of how to gather and analyze fund holding data to help users make more informed investment decisions.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.