How HiSilicon Uses Cloud‑Native Architecture to Build a Multi‑Modal Data Lake
Amid the AI wave, HiSilicon’s digital transformation tackles fragmented industrial data by adopting a cloud‑native, open‑source stack centered on Paimon, creating a unified metadata model, knowledge graph, and elastic scheduling that balances performance and cost while powering AI‑ready services across nine business domains.
Overview
HiSilicon built a cloud‑native, multi‑modal data lake platform to handle fragmented industrial data from production lines and device protocols. The platform aims to reduce costs, improve efficiency, and optimize manufacturing processes by providing an "AI‑Ready" data foundation for AI training and downstream intelligent agents.
Key Technical Challenges
Industrial data originates from diverse equipment with many protocols and formats, leading to high fragmentation.
Data is tightly coupled with production‑line devices, making standardization and unified governance difficult.
Business value depends on deep integration of multimodal data for cost reduction, yield improvement, energy saving, and process optimization.
Architecture and Open‑Source Foundations
The solution adopts a cloud‑native stack (Kubernetes, service mesh, declarative APIs) as the base layer. Open‑source storage engine Paimon is extended to support large‑scale multimodal data, including:
GPU‑aware scheduling operators for unstructured data processing.
Enhanced caching layer for high‑throughput access.
SDKs and Spark‑like extensions for custom data pipelines.
Cloud‑Native Benefits and DataOps
All ETL jobs, data‑processing workflows, and operators are containerized as micro‑services, enabling DataOps practices similar to DevOps. An internally developed intelligent Horizontal Pod Autoscaler (HPA) performs load‑driven elastic scaling, and hybrid‑cloud bursting to public clouds reduces costs during peak loads.
Metadata Management System
High‑performance storage and retrieval : Choose storage components that efficiently handle both structured and unstructured data and expose extensible query APIs.
Unified metadata model : A single modeling approach for all data types enables consistent governance and simplifies access.
Unified semantic layer : Build a knowledge graph from multimodal metadata and entity relationships to support fuzzy semantic queries that accurately retrieve cross‑modal data.
AI‑Ready Interface Layer
The platform provides a unified API that abstracts data access for AI frameworks such as TensorFlow and PyTorch, supporting computer‑vision and NLP scenarios. This layer delivers high‑throughput, low‑latency data services for model training.
Performance‑Cost Optimization Techniques
Tiered storage and intelligent compression reduce storage footprint.
Lifecycle management and versioning purge obsolete unstructured data.
Deduplication and the "one data, multiple applications" principle lower storage overhead.
Dynamic resource scaling and task prioritization avoid idle compute resources.
Future Directions
The team envisions an AI‑native DataOps ecosystem where large‑model capabilities automate quality monitoring, resource allocation, lineage construction, and proactive insight discovery. This will shift governance from manual R&D to an intelligent, self‑optimizing data ecosystem.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
