How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM
Yidun’s public-opinion monitoring platform transforms massive raw web data into a unified format by separating dynamic Groovy-script-driven cleaning from static processing, achieving real-time source integration, high throughput, scalability, and high availability while addressing format diversity, team coordination, and performance-flexibility trade-offs.
Introduction
NetEase Yidun, a one‑stop digital content risk‑control brand, monitors public opinion across the web. Its monitoring platform first converts massive raw data into a unified format—a step called public-opinion data cleaning.
Challenges
Diverse data formats require rapid onboarding of new sources while respecting the open‑closed principle.
Coordinating crawlers, ETL processes, and development teams.
Balancing flexibility of cleaning logic with high‑performance execution.
Dynamic‑Static Separation in Business Architecture
Yidun separates “dynamic” (flexible) and “static” (fixed) parts of the cleaning pipeline. Dynamic parts include configurable Groovy scripts that can be loaded at runtime, while static parts such as deduplication and assembly follow the results of the dynamic stage.
ETL and system development are decoupled, with clear cleaning‑process specifications.
Cleaning scripts are dynamically configurable for real‑time new‑source integration.
Cleaning script chain is managed and reusable.
Static modules (deduplication, assembly) select processing strategies based on dynamic outcomes.
Dynamic Layer: Groovy‑Based Cleaning
Groovy scripts are compiled and loaded into the JVM as ordinary functions, benefiting from JIT optimization for fast execution. Multi‑node deployment ensures high throughput, high availability, and easy scalability.
Static Layer: Unified Data Processing Chain
After cleaning, raw crawler data undergoes deduplication, sentiment tagging, assembly, and keyword extraction. The unified format and defined deduplication/assembly strategies guide the downstream processing chain, which:
Executes logic according to the cleaning‑produced strategies.
Uses a custom routing thread group with Kafka partition ordering to guarantee sequential consumption and assembly.
Partitions crawler data to preserve order and parallelism, balancing performance and correctness.
Conclusion
By combining Groovy scripts with JVM dynamic class loading, Yidun achieves flexible yet high‑performance data cleaning. Multi‑node deployment provides scalability and high availability, enabling real‑time integration of new data sources and faster response in public‑opinion systems.
NetEase Smart Enterprise Tech+
Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
