How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices
This article details MaFengWo’s practical experience using Kafka across three core scenarios—real‑time storage, analytical data source, and business data subscription—while describing a four‑stage evolution that includes version upgrades, resource isolation, security and monitoring enhancements, and a comprehensive subscription platform, followed by future improvement plans.
Application Scenarios
Kafka is used in three core ways on the platform:
Real‑time storage layer – Kafka stores incoming event streams (business DB data, monitoring logs, client‑side logs, server logs) as the primary source for downstream processing.
Analytical data source – The same streams feed offline warehouses, real‑time Druid OLAP, and ad‑hoc query systems.
Subscription service for downstream systems – Recommendation, anti‑fraud, monitoring, and other business services consume Kafka topics via a subscription model.
Evolution Roadmap
The Kafka migration is divided into four stages.
Stage 1 – Version Upgrade
The cluster was upgraded from the legacy 0.8.3 release to 1.1.1. Intermediate releases were evaluated:
0.9 – introduced quotas and security (SASL/ACL).
0.10 – finer‑grained timestamps for efficient replay.
0.11 – idempotent producers, transactions, and leader‑epoch handling.
Version 1.1.1 was selected because it satisfied the required features (quota, idempotence, transactions, leader‑epoch) and remained compatible with the Camus‑to‑HDFS pipeline used for data dump.
Stage 2 – Resource Isolation
Three physical clusters were created to isolate workloads:
Log cluster – receives raw event data directly from producers; it is not exposed for external subscription and feeds Camus for hourly HDFS dumps.
Full‑subscription cluster – mirrors the Log cluster and provides real‑time subscription for internal analytics jobs.
Custom cluster – hosts business‑specific topics that may be merged or split according to downstream requirements.
Within each cluster, topic‑level isolation is enforced by assigning high‑traffic topics (e.g., server-event, mobile-event) to distinct brokers to avoid load skew.
Stage 3 – Permission Control & Monitoring
Authentication & Authorization
SASL/SCRAM is used for authentication, combined with Kafka ACLs for fine‑grained access control. The solution avoids Kerberos (over‑engineered for the internal network) and SSL (not required inside the private LAN).
Monitoring & Alerting
Metrics are collected via the Kafka JMX interface and exported by falcon‑agent to OpenFalcon. Grafana dashboards visualise broker, topic, and consumer metrics (lag, throughput, under‑replicated partitions, data size). An internal alert engine named Radar evaluates thresholds and pushes notifications to owners through enterprise‑WeChat bots.
Stage 4 – Application Extension
A real‑time subscription platform was built to automate the full lifecycle of Kafka usage:
Ticket‑based request workflow for topic creation and consumer group provisioning.
Automatic user and credential provisioning linked to ACLs.
Unified topic‑management UI for creation, isolation configuration, and metadata editing.
Data replay capability that can reset consumer offsets to any timestamp or offset, supporting both Lambda (batch + stream) and Kappa (pure stream) architectures.
Data sharding and cross‑source merging based on business‑defined filters (e.g., app‑code + event‑code combinations).
Key Practices
Version Upgrade – Migrated to 1.1.1 after feature‑by‑feature comparison of 0.9‑0.11 releases.
Resource Isolation – Deployed three logical clusters and enforced broker‑level topic isolation to prevent noisy‑topic interference.
Permission & Monitoring – Implemented SASL/SCRAM + ACL, Falcon agents, Grafana dashboards, and Radar alerts with WeChat bot notifications.
Application Extension – Delivered an end‑to‑end subscription platform with ticket workflow, UI‑driven topic management, and flexible replay/sharding features.
Future Plans
Combine Kafka transactions with Flink’s two‑phase commit to achieve exactly‑once semantics and eliminate duplicate records.
Introduce consumer‑side quota throttling and dynamic threshold adjustment to protect producer latency.
Expand SDKs and HTTP endpoints to support additional programming languages and use‑cases.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
