Big Data 17 min read

How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo's practical experience with Kafka in its big‑data platform, covering three core usage scenarios, a four‑stage evolution roadmap—including version upgrades, resource isolation, security and monitoring—and future plans such as transaction‑based deduplication and consumer throttling.

dbaplus Community

Mar 3, 2020

How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices

Application Scenarios

Kafka is used in three major ways on the platform:

Real‑time storage layer : stores business logs, monitoring logs and client‑side event logs (H5, Web, App, Mini‑program) for immediate consumption.

Data source for analytics : feeds offline data warehouses, real‑time Druid OLAP and other analytical systems.

Subscription service for downstream services : provides data streams to recommendation, search, anti‑fraud, monitoring and other core business units.

Evolution Roadmap

The platform addressed stability and scalability through four progressive stages:

Version upgrade : migrated from Kafka 0.8.3 to 1.1.1 to obtain security (SASL/SCRAM, ACL), quota, idempotent producers, transactional messaging, leader‑epoch handling and simplified controller shutdown.

Resource isolation : built three physical clusters (Log, Full‑Subscription, Custom) and isolated topics per business line to avoid broker overload and guarantee predictable performance.

Permission control & monitoring : introduced SASL/SCRAM + ACL for authentication, and a unified monitoring & alerting platform (named “Radar”) based on Kafka JMX metrics, OpenFalcon, Grafana and Eagle.

Application extensions : created a real‑time subscription platform that automates request approval, user authorization, usage monitoring, and added capabilities such as topic management, data replay and data sharding.

Key Practices

Version Upgrade

Problems with the 0.8.3 version included lack of security, under‑replicated brokers, missing features (transactions, idempotence, timestamps), heavy Zookeeper dependency and insufficient monitoring. Selecting 1.1.1 provided:

Quota and ACL for fine‑grained access control.

Idempotent producers and transactional messaging (exactly‑once semantics for downstream Flink jobs).

Leader‑epoch handling to avoid data loss during leader changes.

Simplified controller shutdown process.

Better integration with Camus (LinkedIn’s Kafka‑to‑HDFS dump tool).

Resource Isolation

Three clusters were defined:

Log cluster : receives raw event data from all sources. It is high‑availability, does not expose subscription interfaces, and dumps data to HDFS via Camus for offline processing.

Full‑Subscription cluster : mirrors the Log cluster in real time and serves internal analytics and real‑time jobs.

Custom cluster : hosts business‑specific topics that require dedicated isolation or custom retention policies.

Topic‑level isolation is also enforced: high‑volume topics (e.g., server-event and mobile-event) are placed on separate brokers to prevent load skew.

Permission Control & Monitoring

Authentication & Authorization : The platform uses SASL/SCRAM with ACLs. Users are created dynamically; permissions are granted per topic or consumer group, eliminating the previous “bare‑run” state where any client could read/write.

Monitoring Stack : Kafka JMX Metrics: exposed by brokers; 1.1.1 provides rich metrics (bytes‑in/out, request rates, under‑replicated partitions, leader‑epoch, etc.). OpenFalcon: collects JMX metrics via a Falcon‑agent deployed on each broker. Grafana: visualizes metrics for clusters, brokers, topics and consumer groups. Eagle: extracts consumer lag and active status, exposing an API for the alerting system. Radar (internal alert system): defines thresholds, sends alerts to owners via an enterprise WeChat bot.

Application Extensions – Real‑time Subscription Platform

The platform provides an end‑to‑end workflow for data production and consumption:

Business users submit a subscription request (topic, consumer group, access mode) through a ticketing system.

Platform operators review and approve the request.

Upon approval, the system creates a dedicated user account, assigns ACLs, and returns the broker address and credentials.

All granted resources are automatically registered in the Radar monitoring system for lifecycle tracking.

Additional features:

Topic Management : UI‑driven creation, quota assignment, and metadata management without direct SSH access.

Data Replay : supports resetting consumer offsets to any timestamp or offset, enabling both Lambda and Kappa style reprocessing.

Data Sharding : custom routing rules allow merging or splitting topics based on business keys (e.g., app‑code, event‑code).

Future Plans

Exactly‑once processing : combine Kafka’s transaction mechanism with Flink’s two‑phase commit to eliminate data duplication during failure recovery.

Consumer quota & dynamic throttling : use Kafka Quota to protect producers from heavy read‑disk workloads and allow runtime adjustment of thresholds.

SDK & API expansion : expose Kafka client libraries and HTTP APIs for multiple programming languages to broaden subscription and production scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data pipeline Kafka security Resource Isolation

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.