How Haier Built a Cloud‑Native Multi‑Modal Data Lake for AI‑Ready Manufacturing
Haier’s digital transformation leverages a cloud‑native, open‑source‑based multi‑modal data lake that unifies structured and unstructured industrial data, uses metadata models and knowledge graphs for governance, and provides AI‑ready services that balance performance, cost, and real‑time requirements.
Background
Haier (海信) needed a data platform that could ingest and govern both internet‑derived user‑behavior data and highly fragmented, protocol‑driven production‑line data. The primary business goals are cost reduction, efficiency improvement, and process optimization.
Strategic Approach
The company adopted an open‑source‑first, cloud‑native strategy, building a next‑generation multimodal data lake on top of Kubernetes and the Apache Paimon storage engine.
Key Technical Challenges
Numerous industrial devices produce data in many proprietary protocols and formats, leading to severe fragmentation.
Need to unify structured tables with unstructured assets (images, video, logs) for seamless analytics.
Business‑level KPIs such as yield, energy consumption, and cost must be derived from AI models.
System must deliver real‑time performance and high reliability in a commercial environment.
Solution Architecture
The platform implements a “one‑data‑multiple‑applications” model and an intelligent elastic scheduler that balances performance and cost.
Cloud‑native foundation : Kubernetes orchestrates all services; a service mesh and declarative APIs provide resource pooling, auto‑scaling, and isolation for CPU‑intensive SQL queries and GPU‑accelerated AI workloads.
Metadata & Knowledge Graph : A unified metadata schema and a knowledge graph enable low‑latency cross‑modal queries and semantic search across billions of assets.
AI‑Ready Interface Layer : Standardized API/SDK endpoints expose data to TensorFlow, PyTorch, computer‑vision and NLP pipelines.
Intelligent Horizontal Pod Autoscaler (HPA) : Custom HPA monitors workload characteristics, scales compute pods, and can burst to public‑cloud resources during peak demand.
Metadata & Knowledge Graph Design
High‑performance storage & retrieval : The engine stores structured rows and large binary files with scalable indexing (e.g., columnar files for tables, object storage for blobs).
Unified metadata model : All assets share a common identifier and attribute set, allowing uniform governance, lineage tracking, and policy enforcement.
Enhanced semantic layer : Entities, relationships, and provenance are captured in a graph database; fuzzy‑semantic queries can retrieve relevant multimodal data without knowing exact file paths.
AI‑Ready Data Supply Chain
Data ingestion, ETL, and data‑ops pipelines are containerized micro‑services. The platform provides:
Versioned datasets and lifecycle management to guarantee reproducibility.
Intelligent de‑duplication so that a single physical copy can serve many downstream applications (“one‑data‑multiple‑applications”).
Real‑time streaming ETL for low‑latency model training.
Performance‑Cost Optimization
Storage uses tiered policies (hot, warm, cold) and adaptive compression. Compute resources are elastically allocated via the custom HPA and prioritized queues, preventing idle capacity. Automatic data‑lifecycle jobs clean up obsolete versions, reducing total‑ownership cost (TCO).
Key Innovations on Paimon
While retaining Paimon’s core table format, Haier extended the engine to:
Support heterogeneous compute pools (CPU for SQL, GPU for deep‑learning operators) through a unified resource abstraction.
Integrate a cache layer optimized for multimodal file metadata.
Provide SDKs for custom operators that can be scheduled on GPU nodes.
Future Directions
The roadmap envisions an AI‑Native DataOps layer where large‑model capabilities automatically monitor data quality, predict lineage impacts, and recommend optimal processing strategies. This will turn the platform into a self‑governing data ecosystem that continuously adapts compute resources and governance policies based on real‑time analytics.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
