Introducing the Star River Big Data Development Platform: Architecture, Core Capabilities, and Future Plans
This article presents an in‑depth overview of 58.com’s self‑built Star River big data platform, covering its evolution across three eras, resource management hierarchy, core technical capabilities such as metadata services, data maps and lineage, governance practices, and the roadmap for further enhancements.
1. Platform Overview The Star River platform is a one‑stop big data solution integrating data integration, development, operation, governance, and asset management, aimed at improving development efficiency, reducing operational costs, and empowering business decisions.
2. Evolution Stages The platform has progressed through three eras: 1.0 (basic exploration with Hive and MySQL), 2.0 (core scheduling with diversified data sources and custom metadata services), and 3.0 (full‑link closed‑loop with comprehensive data‑lineage, data maps, and governance).
3. Core Capabilities The platform emphasizes unified data standards, fine‑grained security (row/column level permissions), rich data‑exchange tasks, comprehensive data governance, and global data integration across many sources.
4. Architecture A layered architecture consists of an engine layer at the bottom, a data‑development layer, and a monitoring layer that spans the entire data lifecycle.
5. Resource Management Resources are organized into three tiers (top‑level business groups, first‑level and second‑level organizations) with flexible allocation and 1‑to‑N relationships to support stable task execution and hand‑over.
6. Technical Analysis Key technical pain points addressed include standard implementation, cross‑organization permission management, data search barriers, metadata inconsistency, and lineage accuracy. Solutions involve unified metadata protocols, a metadata service architecture, and support for over twenty data‑source types (MySQL, Hive, HBase, SQLServer, Oracle, Doris, ClickHouse, Redis, etc.).
7. Data Map & Lineage The data map provides searchable metadata (tables, fields, descriptions, domains) via Elasticsearch, while the lineage system tracks dependencies and aims to improve coverage and timeliness.
8. Data Governance Governance is driven by metadata, offering rule configuration, asset‑level governance, and a feedback loop that reduces costs and enhances data security.
9. Future Plans Upcoming work includes flexible governance configuration, full‑link lineage coverage for all task types, dedicated data‑quality inspection, enhanced data‑service query strategies, and broader real‑time data integration.
10. Q&A Highlights Answers address configurable governance rules, real‑time scheduling capabilities, lineage success rates, and the mix of offline and real‑time data ingestion.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
