How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices
This article details MaFengWo's practical experience with Kafka in its big‑data platform, covering three core usage scenarios, a four‑stage evolution roadmap—including version upgrades, resource isolation, security and monitoring—and future plans such as transaction‑based deduplication and consumer throttling.
Application Scenarios
Kafka is used in three major ways on the platform:
Real‑time storage layer : stores business logs, monitoring logs and client‑side event logs (H5, Web, App, Mini‑program) for immediate consumption.
Data source for analytics : feeds offline data warehouses, real‑time Druid OLAP and other analytical systems.
Subscription service for downstream services : provides data streams to recommendation, search, anti‑fraud, monitoring and other core business units.
Evolution Roadmap
The platform addressed stability and scalability through four progressive stages:
Version upgrade : migrated from Kafka 0.8.3 to 1.1.1 to obtain security (SASL/SCRAM, ACL), quota, idempotent producers, transactional messaging, leader‑epoch handling and simplified controller shutdown.
Resource isolation : built three physical clusters (Log, Full‑Subscription, Custom) and isolated topics per business line to avoid broker overload and guarantee predictable performance.
Permission control & monitoring : introduced SASL/SCRAM + ACL for authentication, and a unified monitoring & alerting platform (named “Radar”) based on Kafka JMX metrics, OpenFalcon, Grafana and Eagle.
Application extensions : created a real‑time subscription platform that automates request approval, user authorization, usage monitoring, and added capabilities such as topic management, data replay and data sharding.
Key Practices
Version Upgrade
Problems with the 0.8.3 version included lack of security, under‑replicated brokers, missing features (transactions, idempotence, timestamps), heavy Zookeeper dependency and insufficient monitoring. Selecting 1.1.1 provided:
Quota and ACL for fine‑grained access control.
Idempotent producers and transactional messaging (exactly‑once semantics for downstream Flink jobs).
Leader‑epoch handling to avoid data loss during leader changes.
Simplified controller shutdown process.
Better integration with Camus (LinkedIn’s Kafka‑to‑HDFS dump tool).
Resource Isolation
Three clusters were defined:
Log cluster : receives raw event data from all sources. It is high‑availability, does not expose subscription interfaces, and dumps data to HDFS via Camus for offline processing.
Full‑Subscription cluster : mirrors the Log cluster in real time and serves internal analytics and real‑time jobs.
Custom cluster : hosts business‑specific topics that require dedicated isolation or custom retention policies.
Topic‑level isolation is also enforced: high‑volume topics (e.g., server-event and mobile-event) are placed on separate brokers to prevent load skew.
Permission Control & Monitoring
Authentication & Authorization : The platform uses SASL/SCRAM with ACLs. Users are created dynamically; permissions are granted per topic or consumer group, eliminating the previous “bare‑run” state where any client could read/write.
Monitoring Stack : Kafka JMX Metrics: exposed by brokers; 1.1.1 provides rich metrics (bytes‑in/out, request rates, under‑replicated partitions, leader‑epoch, etc.). OpenFalcon: collects JMX metrics via a Falcon‑agent deployed on each broker. Grafana: visualizes metrics for clusters, brokers, topics and consumer groups. Eagle: extracts consumer lag and active status, exposing an API for the alerting system. Radar (internal alert system): defines thresholds, sends alerts to owners via an enterprise WeChat bot.
Application Extensions – Real‑time Subscription Platform
The platform provides an end‑to‑end workflow for data production and consumption:
Business users submit a subscription request (topic, consumer group, access mode) through a ticketing system.
Platform operators review and approve the request.
Upon approval, the system creates a dedicated user account, assigns ACLs, and returns the broker address and credentials.
All granted resources are automatically registered in the Radar monitoring system for lifecycle tracking.
Additional features:
Topic Management : UI‑driven creation, quota assignment, and metadata management without direct SSH access.
Data Replay : supports resetting consumer offsets to any timestamp or offset, enabling both Lambda and Kappa style reprocessing.
Data Sharding : custom routing rules allow merging or splitting topics based on business keys (e.g., app‑code, event‑code).
Future Plans
Exactly‑once processing : combine Kafka’s transaction mechanism with Flink’s two‑phase commit to eliminate data duplication during failure recovery.
Consumer quota & dynamic throttling : use Kafka Quota to protect producers from heavy read‑disk workloads and allow runtime adjustment of thresholds.
SDK & API expansion : expose Kafka client libraries and HTTP APIs for multiple programming languages to broaden subscription and production scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
