Big Data 5 min read

How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM

Yidun’s public-opinion monitoring platform transforms massive raw web data into a unified format by separating dynamic Groovy-script-driven cleaning from static processing, achieving real-time source integration, high throughput, scalability, and high availability while addressing format diversity, team coordination, and performance-flexibility trade-offs.

NetEase Smart Enterprise Tech+

Jan 14, 2021

How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM

Introduction

NetEase Yidun, a one‑stop digital content risk‑control brand, monitors public opinion across the web. Its monitoring platform first converts massive raw data into a unified format—a step called public-opinion data cleaning.

Challenges

Diverse data formats require rapid onboarding of new sources while respecting the open‑closed principle.

Coordinating crawlers, ETL processes, and development teams.

Balancing flexibility of cleaning logic with high‑performance execution.

Dynamic‑Static Separation in Business Architecture

Yidun separates “dynamic” (flexible) and “static” (fixed) parts of the cleaning pipeline. Dynamic parts include configurable Groovy scripts that can be loaded at runtime, while static parts such as deduplication and assembly follow the results of the dynamic stage.

ETL and system development are decoupled, with clear cleaning‑process specifications.

Cleaning scripts are dynamically configurable for real‑time new‑source integration.

Cleaning script chain is managed and reusable.

Static modules (deduplication, assembly) select processing strategies based on dynamic outcomes.

Dynamic Layer: Groovy‑Based Cleaning

Groovy scripts are compiled and loaded into the JVM as ordinary functions, benefiting from JIT optimization for fast execution. Multi‑node deployment ensures high throughput, high availability, and easy scalability.

Static Layer: Unified Data Processing Chain

After cleaning, raw crawler data undergoes deduplication, sentiment tagging, assembly, and keyword extraction. The unified format and defined deduplication/assembly strategies guide the downstream processing chain, which:

Executes logic according to the cleaning‑produced strategies.

Uses a custom routing thread group with Kafka partition ordering to guarantee sequential consumption and assembly.

Partitions crawler data to preserve order and parallelism, balancing performance and correctness.

Conclusion

By combining Groovy scripts with JVM dynamic class loading, Yidun achieves flexible yet high‑performance data cleaning. Multi‑node deployment provides scalability and high availability, enabling real‑time integration of new data sources and faster response in public‑opinion systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JVM big data Kafka ETL data cleaning Groovy public opinion

Written by

NetEase Smart Enterprise Tech+

Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.