Containerization Journey at Ximalaya: Practices, Tools, and Lessons Learned
This article recounts Ximalaya’s evolution from early Docker adoption to a mature cloud‑native deployment platform, detailing principles, custom tools such as barge and k8s‑sync, health‑check strategies, multi‑process management, and integration with existing middleware to achieve reliable, zero‑downtime service releases.
Ximalaya’s containerization journey has accompanied the company’s development and bears a deep imprint of Ximalaya.
Primarily Java projects.
Java projects are divided into Web projects and RPC projects (based on a self‑developed framework similar to Dubbo).
Publishing platform, RPC micro‑service framework, and self‑developed gateway.
During this process we adhered to several principles:
Developers do not need to write Dockerfiles or understand container concepts.
In test environments, developers’ machines can directly access containers; containers across physical machines can communicate via IP.
Kubernetes is split into Test, UAT, and Production clusters; project configuration is separated into environment‑specific and environment‑agnostic parts; the publishing system and project‑info DB are deployed in the test environment and can access all three clusters.
If a project fails to start, the failure scene is retained instead of continuously restarting.
How to Let Developers Publish Code to Container Environments
At the end of 2016, Docker was installed on a Jenkins machine to create a Docker‑based project template. Developers clone the template, modify source address, CPU, memory, etc., and Jenkins triggers a shell script:
Maven compiles the project into a WAR/JAR package.
Based on the WAR/JAR, a Dockerfile is assembled. docker build creates a Docker image and pushes it to an image registry.
The Marathon deployment command is invoked.
It took about five days to build the initial components, and the containerization effort grew from this low‑baseline version.
Evolution
Gradual migration from Marathon to Kubernetes. During the transition we built a custom Docker publishing system that abstracts differences between Marathon and Kubernetes.
Implemented a CLI tool barge. Developers add a barge.yaml file to the project to define basic configuration such as project name.
Integrated with the company’s publishing platform (similar to Alibaba Cloud), shielding developers from physical‑machine vs. container differences.
Should a Container Run Multiple Processes
The prevailing container philosophy is “one process per container,” but early adoption required running SSH inside containers to reduce the learning curve and hide differences from physical machines, necessitating a multi‑process manager. After evaluating runit, systemd, and supervisor, we chose runit as the entrypoint.
Because Web service IPs change on each deployment, we started a nile process inside each container to register project information to ZooKeeper using an open‑source upsync plugin (originally supporting etcd and Consul). Later we switched Nginx upstream to Consul.
With Kubernetes rollout, we introduced gotty + kubectl exec for browser‑based console access and a dedicated gateway system, changing HTTP flow to nginx → gateway → Web service. The SSH and nile processes were phased out. The entrypoint remains runit, now tuned so that if the business process fails, it does not restart, preserving the failure scene for debugging.
The Inconsistent Health Check
Kubernetes readiness probes are used to detect service health. Over time we refined the configuration:
Each Web project provides a /healthcheck endpoint; passing indicates successful startup.
For RPC services, /healthcheck sometimes does not reflect actual readiness, so we also check whether the RPC listening port is open.
Configuring readiness probes (HTTP or TCP) added burden to developers and often caused false alarms when configurations changed.
We switched to exec‑based probes, letting nile automatically perform HTTP/TCP checks based on project metadata, still relying on accurate configuration.
Since the scenario “/healthcheck succeeds but RPC fails to start” is rare, we reverted to HTTP /healthcheck only. Initially liveness probes also used /healthcheck , but network changes in a data‑center caused widespread failures and restarts. We stopped configuring liveness probes and relied solely on readiness probes; for node failures Kubernetes recovers, while out‑of‑memory container crashes are handled via memory alerts.
Integration with the Publishing Platform
Ximalaya’s publishing platform enforces a rule: the first instance of a new service must be released alone for observation before scaling out. This can take up to a week, making Kubernetes’ native rolling update insufficient. Therefore we designed a system where each project maps to two Deployments: the new Deployment’s replicas grow while the old Deployment’s replicas shrink. Two common approaches exist:
Create a new Deployment for each release, scaling the new up and the old down.
Alibaba’s OpenKruise provides a CRD named CloneSet that implements similar functionality; we plan to replace the dual‑Deployment design with CloneSet or a custom CRD. The container publishing system abstracts Kubernetes from developers, exposing APIs for release, rollback, and scaling. We also optimized the bulky kubernetes‑client‑java library. To address everyday developer questions (e.g., container IP, startup failures), we built a backend called “Container Cloud Platform.” It aggregates common support issues into a technical solution, reducing manual troubleshooting. One component, wrench , scans Java classpaths and logs to detect JAR conflicts, Tomcat errors, and business‑level log issues, which developers can trigger from a web UI.
Integration with Existing Middleware
Ximalaya’s self‑developed gateway and RPC framework originally did not consider container‑specific concerns such as changing IPs. Service instances must call gateway and RPC “online” APIs before becoming reachable, and must invoke “offline” APIs before termination. We built a k8s‑sync component that watches pod state changes and calls the appropriate online/offline interfaces at the right moments. Future plans include custom CRDs like WebService and RpcService with controllers handling lifecycle management.
Zero‑Loss Online/Offline for Web/RPC Services
Online
Pods have a readiness probe that runs the project’s health check. Once passed, the pod becomes Ready; k8s‑sync detects the Ready event and invokes the Web/RPC online API. If k8s‑sync crashes, the pod remains Ready but cannot be brought online; we mitigate this by running k8s‑sync as a managed container or systemd service with automatic restart and Prometheus‑based alerts.
Offline
Kubernetes executes a preStop hook on pod termination, which first calls the Web/RPC offline API and then runs Ximalaya’s zero‑traffic check (via xdcs ). If the check passes, preStop completes and the pod is destroyed; otherwise it waits 10 seconds before proceeding. This ensures graceful shutdown for normal deletions, node drains, or scaling down. Physical node failures cannot guarantee zero‑loss. k8s‑sync also syncs pod information to MySQL for developers to query project name, IP, etc., and has undergone multiple refactorings.
Insights
From 2016 to now, the container ecosystem and company middleware evolved together, often requiring us to “build a bridge when encountering water.”
We invested heavily in integrating containers with existing middleware (k8s‑sync, upsync modifications).
We eliminated the need for developers to write Dockerfiles (barge, container cloud platform).
The journey involved many false starts, extensive reading, and community interaction, forcing us to distinguish essential factors from optional ones and to balance proactive safeguards with reactive handling. Although I started as a Java developer, most container components are written in Go, giving me insight into both Java’s shared‑memory model and Go’s CSP model, deepening my concurrency understanding. To promote container adoption, we tackled many non‑container‑specific tasks, such as diagnosing why a failing container was not a Docker issue, leading to tools like wrench that solve seemingly non‑technical problems. Collaboration and communication are vital; container technology reshapes infrastructure, development habits, and requires collective effort. In retrospect, the entire experience is a prelude to future advancements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
