Cloud Native 17 min read

How Huolala Built a Cloud‑Native One‑Stop AI Platform on Kubernetes

Huolala’s Big Data Intelligent Platform team describes how they built a cloud‑native, one‑stop AI solution on Kubernetes, integrating Flink‑based feature engineering, a multi‑tenant Zeppelin notebook, GPU‑aware training, and a unified model‑serving platform, while addressing resource isolation, storage persistence, and cross‑cloud deployment.

Huolala Tech

Mar 23, 2023

How Huolala Built a Cloud‑Native One‑Stop AI Platform on Kubernetes

Background

Data engineering, model training, and online services are the three pillars of machine‑learning implementation. The workflow handles massive data volumes, heavy computation, diverse training frameworks, and complex dependencies, requiring manual intervention for resource and compute management, which hinders AI adoption. Since 2020, Huolala’s Big Data Intelligent Platform team has been developing a complete cloud‑native, one‑stop AI solution. They built a Kubernetes cluster, provided on‑demand GPU training, created isolated resource pools for other departments, and added monitoring, alerting, container log collection, and object storage (OBS, S3) to ensure stable operation of AI workloads on a cloud‑native foundation.

Overall Framework

The diagram shows the end‑to‑end AI pipeline: data ingestion, feature engineering, analysis, visualization, notebook development, model training, and online deployment. The platform includes a Flink‑on‑K8s feature platform, a multi‑engine notebook for development and training, and a model‑online‑inference service that registers and publishes models. The feature platform runs Flink jobs on Kubernetes, and all big‑data storage and compute can be packaged into container images to avoid host‑level dependencies.

Feature Platform

The feature platform serves data scientists, data engineers, and ML engineers, solving problems of scattered storage, duplicate features, complex extraction, long pipelines, and difficult usage. It performs fast ETL from HBase, Hive, and relational ODS layers to destinations such as Elasticsearch, Redis, HBase, and Hive, while managing metadata and providing a unified data export for model testing, training, inference, and other applications.

Stream‑Batch Integration

K8s Task Flow

The core of the feature platform uses Flink for ETL, scheduled on a Kubernetes cluster. PyFlink implements the job logic, moving data from sources to sinks with custom Redis and optimized Elasticsearch connectors. CronJob objects create periodic Flink tasks, and an API allows external triggers. Logging is handled by an EFK stack at the platform level.

Interactive Modeling

Notebook is a customized Apache Zeppelin built for data analysts and AI developers. It supports multiple engines (Spark, Python, JDBC, Markdown, Shell, Beeline, Hive SQL, TensorFlow). Deploying Zeppelin on Kubernetes solves scalability, isolation, and dependency issues. Each user gets an independent Zeppelin server in a dedicated namespace, with persistent storage via NFS or S3 and an init container that copies demo notebooks.

Zeppelin service lifecycle: using the Java K8s API to create Namespace, ConfigMap, Service, RBAC, PV, PVC, and Deployment; mounting NFS/S3 for persistence; and an init container to copy user notebooks.

Access routing: each user’s namespace gets a unique Zeppelin server and URL, proxied through an Nginx NodePort service that maps URLs to the correct namespace.

Compute framework support: Livy bridges Spark jobs to the existing YARN cluster, and JDBC extensions enable Hive SQL and OLAP engines.

Compute‑Storage Integration

Zeppelin bundles required big‑data clients, DNS configuration, authentication, and dependency handling, allowing users to directly access production Hive and HDFS without manual setup.

Resource Isolation and Recycling

K8s CPU, GPU, and memory are limited. ResourceQuota is applied per namespace, and nodes are labeled to separate notebook‑shared and dedicated resources. Users can request specific resource groups, and approvals control high‑priority workloads. Idle notebooks are reclaimed by monitoring inactivity and terminating pods after a configurable idle period.

Model Training GPU Management

In heterogeneous clusters with mixed cloud‑provider and on‑premise machines, GPU resources are grouped into separate clusters and regions. A dedicated GPU scheduling layer selects appropriate GPU nodes from resource groups. When a GPU task starts, the system binds the GPU to the notebook, powers on the GPU instance via an ops API, and shuts it down when idle or manually stopped.

Data Persistence and Multi‑Framework Sharing

To avoid loss of notebook files after container restarts, a distributed storage is mounted to each Zeppelin server and its compute pods. Using a DaemonSet, each node runs an s3fs pod that mounts object storage to a host directory; pods then mount subpaths for individual users, achieving both isolation and shared access across pods.

Notebook Task Driver

Notebook scripts can be scheduled via CronJob objects. The platform exposes an API for creating these tasks, allowing external systems to trigger notebook execution at defined times, thus adding periodic and externally driven capabilities to the notebook environment.

Model Online Inference

Online inference is the final stage of the AI workflow. While small models can be embedded directly in backend services, large and frequently updated models require a dedicated serving layer. A unified model‑serving platform built on cloud‑native Kubernetes and service mesh provides model registration, versioning, deployment, monitoring, and auto‑scaling.

Overall Architecture

Model Management

Users upload trained models, register them, and manage versions through a Go‑based CLI that stores model artifacts in OSS. The platform validates model formats to prevent invalid registrations.

Model Service

Model services are deployed as Kubernetes Deployments containing three containers: an init container that mounts the OSS model files, a Java proxy that collects metrics, and a TfServing container that runs inference. Each service creates a LoadBalancer Service exposing a URL. Metrics such as request QPS, latency, and health are aggregated into the company‑wide monitoring system.

Conclusion

The article outlines Huolala’s cloud‑native AI platform, demonstrating that containerization is now a standard for AI workloads involving machine learning, deep learning, distributed training, and GPU resource management. The team progressed from a self‑built Huawei Cloud K8s cluster to Alibaba Cloud ACK, achieving cross‑cloud operation. While the current platform covers most AI pipeline stages, further work remains on distributed training, GPU‑distributed inference, and broader system capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Kubernetes GPU scheduling AI platform model serving

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.