Beike's Data Development Platform: Evolution, Architecture, and Future Outlook
The talk by Beike senior engineer Yang Zongqiang details the evolution of the company's data development platform, covering background, three architecture upgrades, platform features such as metadata management, data integration, scheduling, quality assurance, and future directions for building an enterprise‑grade big‑data system.
01 Background
Initially Beike's data volume was small and business teams handled data needs themselves. As business grew, data requirements became complex, leading to the establishment of a big‑data department in 2014 to study and develop data solutions, focusing on property, user, and behavior data.
Property data: building a dictionary since 2008, now over 200 million property records.
User data: buyers, tenants, owners, agents, and later brand and renovation personnel.
Behavior data: online browsing and offline viewing activities.
Design principles: cost reduction, efficiency improvement, and standardization.
02 Exploration Journey
First stage (2014) used Hadoop ecosystem (Hadoop, Hive, Sqoop) with a layered data‑warehouse model (ingestion, warehouse, reporting). This approach suffered from custom development inflexibility, simple scheduling (Zeus+Python+Shell) and data‑security issues.
Subsequent platform‑ization introduced a data‑management platform and an Ad‑hoc query platform, integrating metadata management, data quality, security, and a unified scheduling system.
Advantages: resolved warehouse bottlenecks, enabled business‑driven data product development, and provided fast, visualized troubleshooting.
Remaining challenges: increased task load, resource contention, and difficulty controlling data‑development quality.
03 Platform Overview
Data Management
Unified metadata model covering relational, non‑relational, log, and semi‑structured data.
End‑to‑end data lineage and asset management.
Capability to expose data assets via data maps and APIs.
Data Integration
Supports MySQL, Oracle, SQL Server, TiDB, MongoDB, Kafka, etc., achieving >99% coverage of business data ingestion with configurable, automated pipelines, including data migration and split scenarios.
Job Scheduling
Provides visual workflow configuration, dependency management, alerting, and scheduling algorithms to prioritize critical jobs and reduce mean‑time‑to‑recovery from hours to minutes.
Data Quality
Implements SQL syntax validation, execution‑plan analysis, runtime monitoring, timeliness checks, and accuracy verification to ensure reliable data delivery.
Data Open‑Access
Offers self‑service Ad‑hoc queries, BI visualizations, API‑based data services, and change‑notification mechanisms to deliver data to downstream applications and users.
04 Summary & Outlook
Asset‑based data management and full‑link tracking enhance data value.
Encryption, masking, and sensitive‑data monitoring protect data throughout its lifecycle.
Standardized components form a reusable enterprise‑grade big‑data platform.
Future work includes integrating IDE capabilities, advanced data‑governance, and AI‑driven management to further improve development efficiency and security.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
