Big Data 16 min read

How We Scaled Kafka for Real‑Time Big Data at Mafengwo: Lessons and Practices

This article details Mafengwo's practical experience using Kafka within its big‑data platform, covering application scenarios, evolution through version upgrades, resource isolation, security and monitoring enhancements, and future plans for data duplication handling and consumer throttling.

Mafengwo Technology

Jan 2, 2020

How We Scaled Kafka for Real‑Time Big Data at Mafengwo: Lessons and Practices

Introduction

Kafka is a popular message‑queue middleware that can process massive data in real time, offering high throughput, low latency, and reliable asynchronous messaging, which helps solve data exchange between different systems.

At Mafengwo, Kafka supports many core services. The following sections share the application scenarios, challenges encountered at different stages, solutions adopted, and future plans.

Part.1 Application Scenarios

Kafka is used in three main ways within the big‑data platform:

As a database : Provides real‑time data storage for business DB data, monitoring logs, client‑side logs (H5, WEB, APP, mini‑program), and server logs.

As a data source for analytics : Feeds embedded logs to offline data, real‑time data warehouses, and analysis systems such as multidimensional queries and real‑time Druid OLAP.

For business data subscription : Supplies real‑time features, user profiles, anti‑fraud, and monitoring alerts to recommendation, traffic, hotel, and content services.

Part.2 Evolution Roadmap

The platform’s Kafka journey consists of four stages:

Version Upgrade : Migrated from the legacy 0.8.3 to 1.1.1 to obtain features such as quotas, security, timestamps, idempotence, transactions, and improved controller shutdown handling.

Resource Isolation : Built multiple physical clusters and isolated topics within clusters to prevent load imbalance and cross‑business interference.

Permission Control & Monitoring : Implemented SASL/SCRAM + ACL for authentication and fine‑grained authorization; established a unified monitoring and alerting system (named “Radar”) using JMX metrics, OpenFalcon, and Grafana.

Application Expansion : Created a real‑time subscription platform that automates request, approval, user authorization, and monitoring workflows, forming a closed‑loop management solution.

Core Practices

1. Version Upgrade

Older versions lacked security, suffered from broker under‑replication, missed new features (transactions, idempotence, timestamps), and relied heavily on Zookeeper for offset management. Upgrading to 1.1 introduced quota control, SASL/SCRAM authentication, ACLs, leader‑epoch handling, and better operational tooling.

2. Resource Isolation

Clusters were split by functional domains (Log cluster, Full‑subscription cluster, Customized cluster) and topics were isolated to avoid hotspot brokers. This reduced load skew and improved fault isolation.

3. Permission Control

Early clusters ran without authentication, exposing data to any client. The platform now uses SASL/SCRAM + ACL to dynamically create users and enforce fine‑grained access control.

4. Monitoring & Alerting

Metrics are exposed via Kafka JMX, collected by Falcon‑agent, visualized in Grafana, and fed into the Radar alert system. Lag, throughput, and broker health are monitored, with alerts sent through enterprise‑WeChat bots.

Application Expansion

Real‑time Data Subscription Platform : Provides end‑to‑end management of Kafka production and consumption requests, user authorization, and monitoring.

Standardized Application Process : Users submit subscription tickets, which are approved before credentials and broker addresses are provisioned.

Integrated Monitoring & Alerting : Resources are automatically registered in Radar for lifecycle monitoring.

Data Replay : Supports resetting consumer offsets for arbitrary timestamps or positions, enabling Kappa‑style replay.

Topic Management : Offers a UI for creating topics, assigning isolation policies, and managing metadata.

Data Sharding : Allows custom topic creation and cross‑source data merging based on business‑defined filters.

Part.3 Future Plans

Eliminate Data Duplication : Combine Kafka transactions with Flink’s two‑phase commit to achieve exactly‑once semantics.

Consumer Throttling : Apply Kafka quota mechanisms to limit consumer I/O and dynamically adjust thresholds.

Scenario Expansion : Extend SDKs, HTTP APIs, and other interfaces to support more languages and use cases.

The article concludes with an invitation for feedback and suggestions.

big data Kafka security Resource Isolation Data Streaming

Written by

Mafengwo Technology

External communication platform of the Mafengwo Technology team, regularly sharing articles on advanced tech practices, tech exchange events, and recruitment.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.