Big Data 16 min read

How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo’s practical experience using Kafka across three core scenarios—real‑time storage, analytical data source, and business data subscription—while describing a four‑stage evolution that includes version upgrades, resource isolation, security and monitoring enhancements, and a comprehensive subscription platform, followed by future improvement plans.

ITPUB
ITPUB
ITPUB
How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices

Application Scenarios

Kafka is used in three core ways on the platform:

Real‑time storage layer – Kafka stores incoming event streams (business DB data, monitoring logs, client‑side logs, server logs) as the primary source for downstream processing.

Analytical data source – The same streams feed offline warehouses, real‑time Druid OLAP, and ad‑hoc query systems.

Subscription service for downstream systems – Recommendation, anti‑fraud, monitoring, and other business services consume Kafka topics via a subscription model.

Evolution Roadmap

The Kafka migration is divided into four stages.

Stage 1 – Version Upgrade

The cluster was upgraded from the legacy 0.8.3 release to 1.1.1. Intermediate releases were evaluated:

0.9 – introduced quotas and security (SASL/ACL).

0.10 – finer‑grained timestamps for efficient replay.

0.11 – idempotent producers, transactions, and leader‑epoch handling.

Version 1.1.1 was selected because it satisfied the required features (quota, idempotence, transactions, leader‑epoch) and remained compatible with the Camus‑to‑HDFS pipeline used for data dump.

Stage 2 – Resource Isolation

Three physical clusters were created to isolate workloads:

Log cluster – receives raw event data directly from producers; it is not exposed for external subscription and feeds Camus for hourly HDFS dumps.

Full‑subscription cluster – mirrors the Log cluster and provides real‑time subscription for internal analytics jobs.

Custom cluster – hosts business‑specific topics that may be merged or split according to downstream requirements.

Within each cluster, topic‑level isolation is enforced by assigning high‑traffic topics (e.g., server-event, mobile-event) to distinct brokers to avoid load skew.

Stage 3 – Permission Control & Monitoring

Authentication & Authorization

SASL/SCRAM is used for authentication, combined with Kafka ACLs for fine‑grained access control. The solution avoids Kerberos (over‑engineered for the internal network) and SSL (not required inside the private LAN).

Monitoring & Alerting

Metrics are collected via the Kafka JMX interface and exported by falcon‑agent to OpenFalcon. Grafana dashboards visualise broker, topic, and consumer metrics (lag, throughput, under‑replicated partitions, data size). An internal alert engine named Radar evaluates thresholds and pushes notifications to owners through enterprise‑WeChat bots.

Stage 4 – Application Extension

A real‑time subscription platform was built to automate the full lifecycle of Kafka usage:

Ticket‑based request workflow for topic creation and consumer group provisioning.

Automatic user and credential provisioning linked to ACLs.

Unified topic‑management UI for creation, isolation configuration, and metadata editing.

Data replay capability that can reset consumer offsets to any timestamp or offset, supporting both Lambda (batch + stream) and Kappa (pure stream) architectures.

Data sharding and cross‑source merging based on business‑defined filters (e.g., app‑code + event‑code combinations).

Key Practices

Version Upgrade – Migrated to 1.1.1 after feature‑by‑feature comparison of 0.9‑0.11 releases.

Resource Isolation – Deployed three logical clusters and enforced broker‑level topic isolation to prevent noisy‑topic interference.

Permission & Monitoring – Implemented SASL/SCRAM + ACL, Falcon agents, Grafana dashboards, and Radar alerts with WeChat bot notifications.

Application Extension – Delivered an end‑to‑end subscription platform with ticket workflow, UI‑driven topic management, and flexible replay/sharding features.

Future Plans

Combine Kafka transactions with Flink’s two‑phase commit to achieve exactly‑once semantics and eliminate duplicate records.

Introduce consumer‑side quota throttling and dynamic threshold adjustment to protect producer latency.

Expand SDKs and HTTP endpoints to support additional programming languages and use‑cases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataReal-time ProcessingKafkaResource IsolationData Replay
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.