Big Data 11 min read

How Lianjia Built a Low‑Latency Real‑Time Data Platform with Spark Streaming

This article details Lianjia's journey of designing and implementing a low‑latency, stable real‑time computing platform using Spark Streaming on YARN, covering technical selection, architecture components, version compatibility challenges, exactly‑once semantics, graceful shutdown, Kafka tuning, and future enhancements.

Beike Product & Technology

Mar 9, 2018

How Lianjia Built a Low‑Latency Real‑Time Data Platform with Spark Streaming

Background

As Lianjia's business rapidly expanded, the need for timely, accurate, and stable data processing grew, prompting the team to build an easy‑to‑use, low‑latency real‑time computing platform with comprehensive monitoring and alerting.

Technical Selection

Among available real‑time engines—Storm/JStorm, Spark Streaming, Flink—the team chose Spark Streaming because it could run on the existing Hadoop YARN cluster, reducing maintenance costs and aligning with the long‑term goal of unifying real‑time and batch processing.

Platform Overview

The platform consists of several key components:

Jobs on Yarn : collection of ETL and other real‑time jobs running on YARN. Monitor : monitoring system providing registration interfaces, tracking job status, message backlog, and issuing real‑time alerts with automatic restart for failed tasks. Script : wrapper for YARN calls and authentication, exposing job start, stop, and status APIs. Proxy : data read/write service acting as a node proxy, handling permission masking and data joins. Config : web‑based configuration system supporting extensible features.

Experience Summary

Cluster Environment

Hadoop 2.7.3

Spark client: spark-1.6.2-bin-hadoop2.6.tgz

Kafka clusters: 0.9.0.1 and 0.10.2

Usage Pattern

Spark Streaming Direct Approach.

Key Challenges and Solutions

Problem 1 – Component Version Compatibility

Spark and Kafka are both Scala‑based, requiring strict version alignment. The team settled on the following configuration:

<scala.version>2.10.5</scala.version>
<scala.binary.version>2.10</scala.binary.version>
<spark.version>1.6.2</spark.version>
<kafka.version>0.8.2.1</kafka.version>

Problem 2 – Exactly‑Once Semantics

Ensuring exactly‑once processing involves source, processing, and storage layers. While most storage systems (HDFS, HBase, Redis, ES) support idempotent writes, Spark Streaming itself is a micro‑batch engine and does not provide native exactly‑once guarantees. Checkpointing proved unreliable, so the team implemented a custom offset management strategy:

Record untilOffsets for each batch and store them in ZooKeeper after successful processing.

On restart, merge ZooKeeper offsets with Kafka broker offsets to resume from the correct position.

If a batch fails before storing offsets, duplicate processing may occur; downstream idempotent handling or UUID‑based deduplication is required.

Offsets are saved on the driver side, keyed by timestamp, and can be retrieved in order for later use.

Problem 3 – Graceful Shutdown

Setting spark.streaming.stopGracefullyOnShutdown=true and sending SIGTERM to the ApplicationMaster works but locating the AM PID is cumbersome. The team adopted a flag‑based approach: an external marker is set, a background thread polls the marker, and upon change it calls ssc.stop(stopSparkContext=true, stopGracefully=true) to shut down cleanly.

Problem 4 – Kafka High‑Throughput Tuning

With a topic ingesting >50,000 messages per second, pause‑and‑resume scenarios cause large backlogs. The following tuning parameters were applied: spark.streaming.concurrentJobs=10 – increase job concurrency. spark.streaming.kafka.maxRatePerPartition=2000 – cap per‑partition fetch rate. spark.streaming.kafka.maxRetries=50 – increase retries when fetching offsets.

Application‑level retries: spark.yarn.maxAppAttempts=5 and spark.yarn.am.attemptFailuresValidityInterval=1h. Note that spark.yarn.maxAppAttempts must not exceed the YARN cluster's yarn.resourcemanager.am.max-attempts setting.

Future Outlook

The current platform still lags behind industry leaders like Alibaba and Tencent. Planned improvements include:

Further latency reduction and stability enhancements to meet growing real‑time data demands.

Providing a platform‑as‑a‑service with user‑defined SQL configurations and multi‑engine support (Spark, Flink).

Enhanced monitoring and management with custom metrics and one‑click task control.

Closer collaboration with cluster teams to isolate real‑time node tags, ensuring high availability and better resource utilization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Kafka YARN Spark Streaming Exactly-once

Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.