Big Data 11 min read

Beike One‑Stop Big Data Development Platform: Architecture, Evolution, and Future Outlook

The article summarizes Beike's one‑stop big data development platform, describing its data business background, the evolution from a simple Hadoop‑Kafka‑Hive stack to a metadata‑driven, asset‑oriented platform, and outlines current capabilities in data management, integration, scheduling, quality, openness, and future plans.

Beike Product & Technology

Nov 13, 2020

Beike One‑Stop Big Data Development Platform: Architecture, Evolution, and Future Outlook

1. Opening

In this talk, senior engineer Yang Zongqiang shares the practical experience of building Beike's one‑stop big data development platform, covering four main parts: data business background, platform exploration journey, overall platform overview, and future plans.

2. Beike's Data Business Background

Initially, data acquisition was handled by individual business units. As the company grew, a dedicated big data department was created in 2014, focusing on three data domains: property data (over 200 million records), behavior data (online logs, offline visits), and people data (agents, customers, brand owners). The department follows three principles: cost reduction, efficiency improvement, and standardization.

3. Platform Exploration Journey

3.1 Early Stage

The first platform used a classic Kafka + Sqoop + HDFS + Hive stack, with Oozie for scheduling and MySQL for reporting. Advantages included open‑source components, mature data‑warehouse design, and easy talent cultivation, but problems arose such as slow reporting, ad‑hoc demands, heavy maintenance, and data‑security concerns.

3.2 Platform‑ization Stage

To address early‑stage issues, a platform‑centric solution was introduced, featuring a metadata management system for unified table creation, data access control, and searchable catalogs. A new visual scheduling system replaced Oozie/Airflow, offering drag‑and‑drop configuration, zero‑code ETL components, one‑click data repair, and dual‑engine query (Presto + Spark SQL). This shifted data engineers' focus from cluster ops to building data services and business data products.

Drawbacks included a rapid increase in tables/tasks causing resource pressure and quality control challenges.

3.3 Data Asset Stage

Further enhancements added visual programming tools, automated operations, data standardization, quality control, and extensive data exposure, turning data into valuable assets.

4. Overall Platform Overview

4.1 Data Management

A unified metadata model manages structured, semi‑structured, and unstructured data, linking sources, warehouses, and applications to provide end‑to‑end data lineage and enable internal asset management and external value release.

4.2 Data Integration

Early crude Sqoop/Kafka ingestion was replaced by tools like DataX and DataBus, enabling automated MySQL ingestion, change detection, and online approval/notification for major schema changes.

4.3 Job Scheduling

The platform offers a complete workflow from table creation to testing and deployment, allowing a data engineer to finish a logical job within ten minutes. Robustness is ensured through SQL testing, data‑accuracy checks, and a central scheduler that acts as the heart of batch processing.

4.4 Data Quality

Quality safeguards include monitoring core KPI availability, report correctness, and logic changes, with automated alerts and remediation mechanisms.

4.5 Data Open‑ness

Data value is realized through external sharing via two channels: the self‑built Odin platform (similar to Tableau) for ad‑hoc analysis, and a data open platform for downstream systems. Large‑scale APIs serve both C‑end and B‑end services, handling over 200 million daily calls, with a Hadoop + Hive backend complemented by a data‑mid‑layer for performance.

5. Future Planning and Outlook

Future work focuses on full data‑asset management with end‑to‑end tracing, comprehensive security (encryption, masking, monitoring), and consolidating reusable components to build an enterprise‑grade big data platform.

data engineering big data data platform ETL data governance metadata management

Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.