Web Data Mining and Analysis of the “Da Gai Er” Section of the Caoliu Forum Using PHP
This article presents a PHP‑based web‑scraping experiment that collects and visualizes several months of data from the “Da Gai Er” board of the Caoliu forum, revealing user activity patterns, image hosting distribution, registration trends, and overall forum health through charts and statistical summaries.
The author, a part‑time developer and Unix enthusiast, conducted a web‑data collection and analysis experiment on the Caoliu forum’s “Da Gai Er” section, which contains user‑generated content valuable for research.
Data were gathered from 8,537 thread titles, posting times, reply counts, and 576 users over the period 2015‑06‑05 to 2015‑09‑07, along with 12,884 image URLs, 13,070 reply records, and 11,250 user profiles (username, registration time, last login).
Key observations include an average of 14.8 threads per user, a top contributor with 276 posts, and a rapid increase in new users starting in August, accounting for 50.6% of registrations in the most recent year.
Image hosting analysis shows that the majority of images are stored on ihostimg.com, with a notable presence of Sina’s cloud storage (sinaimg) as a secondary host.
Forum activity metrics indicate that as the number of registered users grew, so did the number of new threads and replies, with peak reply activity occurring around 10 AM rather than the expected evening hours.
The underlying data are stored in a MySQL database, and the project’s technical stack consists of PHP with CURL for crawling, SimpleHtmlDom for HTML parsing, PSCWS4 for Chinese word segmentation, Bootstrap for front‑end display, and HighCharts for data visualization.
The author suggests that deeper analysis—such as downloading the collected images for training a porn‑image detection model or incorporating additional user metrics like reputation and contribution—could yield further insights.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
