Big Data 19 min read

Inside Xinghuan Tech’s Next‑Gen Big Data 3.0 Architecture: Unified, Cloud‑Native, Real‑Time

This article details Xinghuan Technology’s evolution from 2013 to the present, describing its self‑developed Big Data 3.0 stack—including a unified data platform, SQL‑centric development, cloud‑native resource scheduling, distributed storage managed by Raft, DAG‑based compute engines, and real‑time stream processing—while highlighting key milestones and design principles that differentiate it from traditional Hadoop‑based solutions.

StarRing Big Data Open Lab

Feb 17, 2023

Inside Xinghuan Tech’s Next‑Gen Big Data 3.0 Architecture: Unified, Cloud‑Native, Real‑Time

Overview

Since its founding in 2013, Xinghuan Technology has focused on integrating big‑data foundational technologies with enterprise data services, creating a series of world‑class breakthroughs tailored to China’s complex data‑application scenarios.

Big Data 3.0 Technology Stack

To meet new data‑business demands and resolve legacy issues, Xinghuan redesigned its stack into a highly unified platform that addresses the four V’s of big data, enabling a value‑chain from persistence to ecosystem.

Design Considerations & Overall Architecture

Unified data platform replaces mixed architectures (data lake, warehouse, marts, search) with a one‑stop solution that eliminates redundancy and latency.

SQL as a unified interface leverages the mature, widely‑adopted language to support warehouses, OLTP, search, spatio‑temporal databases, reducing development difficulty.

Cloud‑native deployment uses containers and Kubernetes to provide elastic, on‑demand resources across CPU, GPU, network, and storage.

Data‑business integration creates a unified data warehouse, model marketplace, and application market to support both data‑centric and application‑centric workflows.

Layered Architecture

Resource Scheduling Layer

Built on Kubernetes, this layer manages configuration, physical resource pools, distributed storage, and cloud networking, enabling precise scheduling of big‑data, AI, and database workloads.

Unified Storage Management Layer

Abstracts common storage functions (consistency, MVCC, transaction, metadata, partitioning, fault‑tolerance) behind a Raft‑based control plane, allowing plug‑in storage engines to become highly available distributed systems.

Distributed Block Storage Layer

Provides unified block storage with strong consistency guarantees via Raft, supporting various specialized engines (graph, GIS, high‑dimensional features) without reinventing core mechanisms.

Compute Engine Layer

Adopts a DAG‑based execution model with vectorized processing and quantized execution, delivering superior scalability and performance for batch, interactive, and real‑time workloads.

Development Interface Layer

Offers a SQL compiler, optimizer suite, and distributed transaction unit, enabling developers to work with a familiar SQL interface while the system handles warehouses, OLTP, search, and graph queries.

lRBO (Rule‑Based Optimizer) : Hundreds of expert rules for IO reduction (filter push‑down, partition pruning, etc.).

ISO (Inter SQL Optimizer) : Merges similar SQLs inside stored procedures into a single DAG for parallel execution.

MBO (Materialize‑Based Optimizer) : Leverages materialized views or cubes to reduce computation.

CBO (Cost‑Based Optimizer) : Chooses plans based on estimated IO, network, and compute costs; future integration of ML‑driven costing is planned.

Real‑Time Stream Processing

Designed a low‑latency (<5 ms) stream engine with a custom StreamSQL extension, CEP engine for complex event patterns, rule engine for business logic, and in‑memory distributed cache for fast metric storage.

USE APPLICATION cep_example;
CREATE STREAM robotarm_2(armid STRING, location STRING) tblproperties(
  "topic"="arm_t2",
  "kafka.ZooKeeper"="localhost:2181",
  "kafka.broker.list"="localhost:9092"
);
CREATE TABLE coords_miss(armid STRING, location STRING);
INSERT INTO coords_miss
SELECT e1.armid, e1.location
FROM PATTERN(
  e1=robotarm_2[e1.location='A'] NOTNEXT
  e2=robotarm_2[e2.armid=e1.armid AND e2.location='B'] ) WITHIN ('1' minute);

Historical Milestones

2015: First Hadoop‑based distributed analytical DB supporting full SQL, stored procedures, and distributed transactions; launched low‑latency (<5 ms) stream engine with StreamSQL.

2017: Early adoption of Docker & Kubernetes for cloud‑native big‑data services, predating Cloudera’s similar effort.

2018: Released trillion‑scale distributed graph DB and flash‑based columnar analytical DB using Raft and custom storage, boosting interactive analysis performance.

Conclusion

Xinghuan Technology will continue to enrich this architecture with new storage and compute capabilities, machine‑learning‑driven data governance, and data‑service publishing, aiming to bridge the gap between data and business and unlock greater value from big data.

real-time processing data platform SQL Optimizer

Written by

StarRing Big Data Open Lab

Focused on big data technology research, exploring the Big Data era | [email protected]

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.