Backend Development 13 min read

Python Web Scraping of Fund Holdings Data and Analysis Using Requests, Selenium, and MongoDB

This tutorial demonstrates how to analyze a fund ranking website, construct dynamic URLs, extract six‑digit fund codes, crawl fund holding pages with requests and Selenium, store the results in MongoDB, and finally process the data to identify the most frequently held stocks across thousands of funds.

Python Programming Learning Circle

Nov 25, 2024

Python Web Scraping of Fund Holdings Data and Analysis Using Requests, Selenium, and MongoDB

The article explains how to obtain fund holding data by first analyzing the target website (天天基金网) and discovering that fund detail URLs are composed of the fund code combined with a base URL, while holding‑detail pages follow a similar pattern using http://fundf10.eastmoney.com/ccmx_ plus the six‑digit code.

Because the data is loaded via JavaScript, the author shows that the request URL captured in the browser’s network panel can be fetched directly with requests, and a regular expression is used to extract the six‑digit fund codes from the response.

The crawling workflow is broken into three steps: (1) retrieve the list page and extract fund codes, (2) build the holding‑detail URLs and fetch each page (using Selenium with explicit waits when necessary), and (3) store the fund URL, name, and retrieved stock list into a MongoDB collection.

Required libraries are listed and imported as follows:

import re
from lxml import etree
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pymongo

Helper functions are provided: is_contain_chinese checks for Chinese characters, is_element verifies element existence with Selenium, get_one_page performs a GET request with headers and optional proxy, page_url builds the list of fund URLs and names, and hold_a_position uses Selenium to extract stock names from the holding‑detail page.

The main script connects to MongoDB, iterates over the fund URLs, calls hold_a_position, assembles a dictionary with fund_url, fund_name, and stock_name, inserts each document into the tb_stock collection, and finally queries for non‑null stock entries.

Data processing is then performed: all stock name arrays are merged into list_stock_all, duplicates are removed to create list_stock_repetition, and stocks appearing more than ten times are stored in a new collection tb_data with fields name and numbers. Sample output documents show counts such as "中国石化" appearing in 54 funds and "招商银行" in 910 funds.

The conclusion highlights that the project uses basic web‑scraping techniques (requests, regex, xpath) together with Selenium for dynamic pages and MongoDB for persistence, providing a practical example of how to gather and analyze fund holding data to help users make more informed investment decisions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MongoDB Web Scraping Selenium

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.