Big Data 15 min read

Design and Evolution of Lianjia's Big Data Platform: Architecture, Challenges, and Solutions

This article details Lianjia's journey from a Hadoop‑based 0.0 data platform to a sophisticated 2.0 architecture, describing the three‑layer design, OLAP engine choices, transparent compression techniques, operational challenges, and practical recommendations for building and maintaining large‑scale big data systems.

Beike Product & Technology

Jun 1, 2018

Design and Evolution of Lianjia's Big Data Platform: Architecture, Challenges, and Solutions

Zhao Guoxian, leader of the Lianjia big data architecture team, introduces the evolution of the company's data platform from its initial Hadoop‑centric 0.0 version to the current 2.0 architecture, highlighting the need for performance optimization, distributed storage, and real‑time processing.

The platform is organized into three layers: the cluster layer (Hadoop, YARN, Spark, Presto, HBase, Oozie) provides distributed storage, resource scheduling, and compute engines; the tool‑chain layer features a self‑developed scheduler, metadata management (Meta), and an intelligent query engine that selects the most suitable engine (Presto, SparkSQL, Hive) based on SQL analysis; the API layer abstracts data access for internal analytics, business services, and generic consumption.

Key challenges encountered include tightly coupled architecture, long development cycles driven by ad‑hoc demands, frequent failures in Hive/SQL jobs, and the difficulty of treating big‑data engineers merely as data‑retrieval specialists. To address these, Lianjia introduced a unified scheduling system, dependency visualization, and a middleware that routes queries to the optimal engine.

For OLAP processing, the article compares ROLAP (real‑time aggregation on raw data) and MOLAP (pre‑computed cubes). After evaluating options, the team selected Apache Kylin for its high concurrency and sub‑second query performance on billions of rows, complemented by Druid for real‑time ingestion and a hybrid OLAP approach that routes queries to the appropriate engine.

To curb rapid data growth and storage costs, a transparent compression strategy was implemented. Cold data is migrated from HDFS to a ZFS file system using gzip compression, while hot data remains on SSD or traditional disks. The solution includes hot‑cold tiering, ZFS features (ARC/L2ARC), and a migration workflow that periodically moves identified cold data.

Future work aims to offload compression to hardware accelerators (QAT), combine erasure coding with compression for higher reliability, and implement intelligent hot‑data warming using SSD caches. The article concludes with practical advice: perform thorough requirement analysis and technology selection, maintain stable iterative development, prioritize monitoring, and continuously optimize online performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture OLAP compression Hadoop Kylin

Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.