Data Intelligence Expert Interview – Maturity, Trends, and Practices of Data Middle Platforms
The interview gathers insights from data‑platform experts on the maturity stages, technology trends, implementation methodologies, open‑source ecosystems, system architectures, governance, security, and assessment criteria of modern data middle platforms, offering a comprehensive guide for practitioners.
DataFun interviewed senior data‑platform engineers to discuss the current focus, challenges, and future directions of data middle platforms, helping readers grasp key technical priorities and improve their own implementations.
1. Technical Maturity Stages
Mature Phase: Offline and real‑time processing pipelines, dominated by Spark and Flink ecosystems.
Hot Phase: LakeHouse technologies (Iceberg, Hudi, Delta Lake) and OLAP engines such as Kylin, Druid, ClickHouse, Doris.
Growth Phase: Data security and governance, still in early, rule‑based stages.
Forward Phase: Data observability, increasingly combined with machine‑learning capabilities.
2. Implementation Methodology
Fast‑growing companies need higher‑level toolchains and focus on data quality and timeliness.
Companies with saturated data growth shift attention to governance and security.
Large enterprises tend to build custom middle‑platforms and later migrate to their own cloud services; SMEs either self‑build or adopt cloud‑vendor solutions.
3. Open‑Source Ecosystem
While open‑source has long been led by foreign organizations, Chinese companies have recently contributed projects such as Apache InLong, SkyWalking, DolphinScheduler, and many commercial tools are emerging, though market maturity still varies.
4. Technical System
Data Integration & Modeling: Emphasis on ETL/Reverse‑ETL, with tools like Airbyte, Fivetran, dbt, and Apache Airflow, DolphinScheduler, etc.
Offline Development: Common stacks include MySQL, MongoDB, Redis, DataX, BitSell, Kafka, RocketMQ, Airflow, Azkaban, Git/SVN for code management, and Apache Griffin for data quality (limited adoption).
Real‑time Development: Flink, Spark Streaming, Storm for compute; Kafka for messaging; ClickHouse, Doris/StarRocks, Druid, HBase, Kudu, data lakes for storage; Impala/Presto for querying.
5. Data System
Real‑time storage options include ClickHouse (single‑table queries), Doris/StarRocks (supports upsert and multi‑table joins), and Druid (time‑series aggregation). Data lakes (Iceberg, Hudi, Delta Lake) provide table‑format abstraction for upsert, partitioning, and schema evolution.
6. Service System
BI dashboards and reports.
OLAP ad‑hoc queries (HUE, Zeppelin, Impala, Presto, ClickHouse, Doris).
Data products (AB‑testing, user‑profile, DMP, recommendation platforms).
Data‑as‑a‑service APIs built with SpringBoot, backed by MySQL, MongoDB, HBase, Redis.
7. Operation System
Focuses on data availability (accuracy, completeness, consistency, timeliness), usability (clear data definitions, metadata, data maps, indicator systems), and security (data classification, permission approval, audit trails).
8. Security Management
Data classification and tiered access control.
Permission approval workflows.
Audit logging and compliance with regulations such as the Data Security Law.
9. Maturity Assessment
Evaluation criteria include breadth (number of business lines using the platform and variety of services) and depth (extent to which services support business needs, from simple reporting to real‑time strategy optimization and intelligent analytics).
Overall, the interview provides a detailed roadmap for building, operating, and evolving a data middle platform in today’s rapidly changing big‑data landscape.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.