How Lianjia Built a Low‑Latency Real‑Time Data Platform with Spark Streaming
This article details Lianjia's journey of designing and implementing a low‑latency, stable real‑time computing platform using Spark Streaming on YARN, covering technical selection, architecture components, version compatibility challenges, exactly‑once semantics, graceful shutdown, Kafka tuning, and future enhancements.
Background
As Lianjia's business rapidly expanded, the need for timely, accurate, and stable data processing grew, prompting the team to build an easy‑to‑use, low‑latency real‑time computing platform with comprehensive monitoring and alerting.
Technical Selection
Among available real‑time engines—Storm/JStorm, Spark Streaming, Flink—the team chose Spark Streaming because it could run on the existing Hadoop YARN cluster, reducing maintenance costs and aligning with the long‑term goal of unifying real‑time and batch processing.
Platform Overview
The platform consists of several key components:
Jobs on Yarn : collection of ETL and other real‑time jobs running on YARN. Monitor : monitoring system providing registration interfaces, tracking job status, message backlog, and issuing real‑time alerts with automatic restart for failed tasks. Script : wrapper for YARN calls and authentication, exposing job start, stop, and status APIs. Proxy : data read/write service acting as a node proxy, handling permission masking and data joins. Config : web‑based configuration system supporting extensible features.
Experience Summary
Cluster Environment
Hadoop 2.7.3
Spark client: spark-1.6.2-bin-hadoop2.6.tgz
Kafka clusters: 0.9.0.1 and 0.10.2
Usage Pattern
Spark Streaming Direct Approach.
Key Challenges and Solutions
Problem 1 – Component Version Compatibility
Spark and Kafka are both Scala‑based, requiring strict version alignment. The team settled on the following configuration:
<scala.version>2.10.5</scala.version>
<scala.binary.version>2.10</scala.binary.version>
<spark.version>1.6.2</spark.version>
<kafka.version>0.8.2.1</kafka.version>Problem 2 – Exactly‑Once Semantics
Ensuring exactly‑once processing involves source, processing, and storage layers. While most storage systems (HDFS, HBase, Redis, ES) support idempotent writes, Spark Streaming itself is a micro‑batch engine and does not provide native exactly‑once guarantees. Checkpointing proved unreliable, so the team implemented a custom offset management strategy:
Record untilOffsets for each batch and store them in ZooKeeper after successful processing.
On restart, merge ZooKeeper offsets with Kafka broker offsets to resume from the correct position.
If a batch fails before storing offsets, duplicate processing may occur; downstream idempotent handling or UUID‑based deduplication is required.
Offsets are saved on the driver side, keyed by timestamp, and can be retrieved in order for later use.
Problem 3 – Graceful Shutdown
Setting spark.streaming.stopGracefullyOnShutdown=true and sending SIGTERM to the ApplicationMaster works but locating the AM PID is cumbersome. The team adopted a flag‑based approach: an external marker is set, a background thread polls the marker, and upon change it calls ssc.stop(stopSparkContext=true, stopGracefully=true) to shut down cleanly.
Problem 4 – Kafka High‑Throughput Tuning
With a topic ingesting >50,000 messages per second, pause‑and‑resume scenarios cause large backlogs. The following tuning parameters were applied: spark.streaming.concurrentJobs=10 – increase job concurrency. spark.streaming.kafka.maxRatePerPartition=2000 – cap per‑partition fetch rate. spark.streaming.kafka.maxRetries=50 – increase retries when fetching offsets.
Application‑level retries: spark.yarn.maxAppAttempts=5 and spark.yarn.am.attemptFailuresValidityInterval=1h. Note that spark.yarn.maxAppAttempts must not exceed the YARN cluster's yarn.resourcemanager.am.max-attempts setting.
Future Outlook
The current platform still lags behind industry leaders like Alibaba and Tencent. Planned improvements include:
Further latency reduction and stability enhancements to meet growing real‑time data demands.
Providing a platform‑as‑a‑service with user‑defined SQL configurations and multi‑engine support (Spark, Flink).
Enhanced monitoring and management with custom metrics and one‑click task control.
Closer collaboration with cluster teams to isolate real‑time node tags, ensuring high availability and better resource utilization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
