Building a Zhihu User Data Crawler and Large‑Scale Analysis with SpringBoot, SeimiCrawler, RabbitMQ, ElasticSearch, and Kibana
This article describes how to build a Java‑based crawler to collect millions of Zhihu user profiles, handle anti‑crawling measures with rotating user‑agents and a proxy pool, deduplicate data using a Bloom filter, import the results into ElasticSearch, and analyze the dataset with Kibana and ECharts visualizations.
The author, a Zhihu newcomer, created a project to crawl and analyze Zhihu user data, with the source code available on GitHub.
The blog’s structure diagram shows two main tasks: crawling Zhihu user information and performing data analysis.
Crawling is performed by parsing user profile pages, extracting key fields, and using the URL token to fetch personal pages, followees, and followers, allowing iterative collection of a large user graph.
The chosen crawler framework is SpringBoot combined with SeimiCrawler, providing near‑zero configuration for Java web crawling.
To bypass Zhihu’s anti‑crawling mechanisms, the project rotates common User‑Agent strings and maintains a high‑availability free proxy pool (primarily Xici proxies). RabbitMQ with ten consumers checks proxy health, stores valid proxies in a database, periodically refreshes the pool, and promotes the most reliable proxies to Redis.
After setting up Redis and RabbitMQ, the service can be started and data fetched via HTTP endpoints such as localhost:8980/users?url_token=…, localhost:8980/users/followees, and localhost:8980/users/followers.
For deduplication, a Bloom filter implementation is used, requiring only about 16 KB of memory for 1.67 M records, dramatically reducing storage compared to Java Set/Map structures.
Because MySQL is inefficient for large‑scale analysis, the project imports data into ElasticSearch. The required Maven dependency is added as follows: <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-elasticsearch</artifactId> </dependency> With ElasticSearch configured, the endpoint localhost:8980/users/transfer migrates the data into an ES index. Kibana is then used for analysis, with examples including querying users with over 1 M followers, aggregating gender ratios, identifying the top 10 cities by residence, performing fuzzy searches, and visualizing results. For richer visualizations, the author uses ECharts to create charts for follower tier distribution, industry breakdown, top companies, job positions, universities, majors, and residence cities, revealing that most active users are programmers from major tech firms in developed cities. In conclusion, the collected dataset shows that Zhihu users with complete profiles are predominantly tech professionals in large cities, and the end‑to‑end pipeline demonstrates a practical big‑data workflow from crawling to analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
