Cloud Native 12 min read

How HiSilicon Uses Cloud‑Native Architecture to Build a Multi‑Modal Data Lake

Amid the AI wave, HiSilicon’s digital transformation tackles fragmented industrial data by adopting a cloud‑native, open‑source stack centered on Paimon, creating a unified metadata model, knowledge graph, and elastic scheduling that balances performance and cost while powering AI‑ready services across nine business domains.

DataFunSummit

Dec 19, 2025

How HiSilicon Uses Cloud‑Native Architecture to Build a Multi‑Modal Data Lake

Overview

HiSilicon built a cloud‑native, multi‑modal data lake platform to handle fragmented industrial data from production lines and device protocols. The platform aims to reduce costs, improve efficiency, and optimize manufacturing processes by providing an "AI‑Ready" data foundation for AI training and downstream intelligent agents.

Key Technical Challenges

Industrial data originates from diverse equipment with many protocols and formats, leading to high fragmentation.

Data is tightly coupled with production‑line devices, making standardization and unified governance difficult.

Business value depends on deep integration of multimodal data for cost reduction, yield improvement, energy saving, and process optimization.

Architecture and Open‑Source Foundations

The solution adopts a cloud‑native stack (Kubernetes, service mesh, declarative APIs) as the base layer. Open‑source storage engine Paimon is extended to support large‑scale multimodal data, including:

GPU‑aware scheduling operators for unstructured data processing.

Enhanced caching layer for high‑throughput access.

SDKs and Spark‑like extensions for custom data pipelines.

Cloud‑Native Benefits and DataOps

All ETL jobs, data‑processing workflows, and operators are containerized as micro‑services, enabling DataOps practices similar to DevOps. An internally developed intelligent Horizontal Pod Autoscaler (HPA) performs load‑driven elastic scaling, and hybrid‑cloud bursting to public clouds reduces costs during peak loads.

Metadata Management System

High‑performance storage and retrieval : Choose storage components that efficiently handle both structured and unstructured data and expose extensible query APIs.

Unified metadata model : A single modeling approach for all data types enables consistent governance and simplifies access.

Unified semantic layer : Build a knowledge graph from multimodal metadata and entity relationships to support fuzzy semantic queries that accurately retrieve cross‑modal data.

AI‑Ready Interface Layer

The platform provides a unified API that abstracts data access for AI frameworks such as TensorFlow and PyTorch, supporting computer‑vision and NLP scenarios. This layer delivers high‑throughput, low‑latency data services for model training.

Performance‑Cost Optimization Techniques

Tiered storage and intelligent compression reduce storage footprint.

Lifecycle management and versioning purge obsolete unstructured data.

Deduplication and the "one data, multiple applications" principle lower storage overhead.

Dynamic resource scaling and task prioritization avoid idle compute resources.

Future Directions

The team envisions an AI‑native DataOps ecosystem where large‑model capabilities automate quality monitoring, resource allocation, lineage construction, and proactive insight discovery. This will shift governance from manual R&D to an intelligent, self‑optimizing data ecosystem.

cloud-native AI metadata Knowledge Graph big-data data-lake

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.