How Youzan Scaled Development with Containerization: Challenges and Solutions
This article examines Youzan's journey to containerize its development and testing environments using Kubernetes and Docker, detailing the motivations, architectural decisions, network and isolation challenges, image integration, logging, load balancing, debugging, and the ongoing rollout to standard production environments.
Introduction
Containerization was adopted to accelerate delivery of development and testing environments and to address resource contention among parallel projects.
Motivation
Each project required isolated daily (development) and QA environments that could be created and destroyed along the project lifecycle, enabling rapid environment provisioning.
Solution Overview
The platform runs on Kubernetes 1.7.10 with Docker 1.12.6/1.13.1. The following sections describe the main technical challenges and the applied solutions.
Network
Backend services are Java applications using a custom Dubbo framework. Full containerization was not possible, so network interoperability with existing clusters was required. Overlay networking on public clouds proved unreliable, so a macvlan‑based hosted network was used, providing direct L2 connectivity without performance loss. Later multi‑cloud support added overlay and VPC networking to regain elasticity.
Isolation
Containers use kernel namespaces and cgroups, but /proc still reports the host’s physical CPU count and memory size, causing inaccurate resource visibility inside containers.
Memory Issue
Java applications adjust JVM parameters based on detected memory. The team mitigated the mis‑reporting by mounting lxcfs, which virtualizes /proc/meminfo for containers.
CPU Count Issue
Kubernetes default CPU‑share limits and over‑commit policies left the reported CPU count incorrect even with lxcfs. JVMs and many Java SDKs base thread‑pool sizes on the reported CPU count, leading to excessive threads and memory usage. The solution introduced an environment variable NUM_CPUS and, for Java, preloaded a library via LD_PRELOAD that overrides ActiveProcessorCount to return NUM_CPUS.
Application Integration
All services were already integrated with an internal release system, so container adoption required minimal changes. No Dockerfiles were needed from business teams.
Node.js, Python, and PHP‑SOA applications managed by supervisord only need an app.yaml in the Git repository to declare the runtime and start command.
Standardized Java applications run unchanged.
Non‑standard Java applications must be refactored to follow the standard launch model.
Image Integration
Images are built in three layers: stack (OS), runtime (language environment), and application (business code plus auxiliary agents). Initially each environment built its own image, but pod startup order constraints led to packing all services of a pod into a single container.
Image construction is orchestrated by Kubernetes: a packaging pod compiles code, installs dependencies, generates a Dockerfile, and runs Docker‑in‑Docker to build and push the image. PersistentVolumeClaims cache Python virtualenvs, Node.js node_modules, and Maven repositories to speed up builds. Newer Docker CE versions are used to leverage ADD --chown, avoiding extra layers for file ownership changes.
Load Balancing (Ingress)
The organization already operates a self‑developed service mesh and a unified access system. Instead of a full Ingress controller, a sync program watches the Kubernetes API for Service changes and updates the upstream list in the unified access system, handling external HTTP traffic.
Container Login and Debugging
Because console access was cumbersome, SSH access was enabled for project and continuous‑delivery environments that require frequent debugging. A special debug‑release mode disables health checks, allowing developers to inspect failing pods.
Logging
Logs are collected by an internal system called “Tianwang”. Container stdout is treated as supplemental. Fluentd gathers the output, formats it according to Tianwang’s schema, forwards it to Kafka, and finally indexes it in Elasticsearch.
Canary Release
Canary traffic includes user‑side HTTP requests, inter‑service HTTP calls, and Dubbo calls. Labels (e.g., user, shop) are attached at the unified entry point and propagated through HTTP and Dubbo clients. A dedicated canary deployment is created, and the canary configuration center applies routing rules so downstream services respect the canary logic.
Standard Environment Containerization
Rationale
Daily, QA, pre‑release, and production environments often run on under‑utilized servers, wasting resources.
Running these environments on single VMs makes simultaneous releases risky.
VM provisioning is slower, and using VMs for canary releases adds complexity.
Long‑lived VMs create challenges for OS and software version convergence.
Progress
After containerizing project and continuous‑delivery environments, most applications are ready for production containerization. The operational stack (monitoring, release, logging, etc.) is being adapted. Production rollout has started with several front‑end Node.js services, and migration of additional services is ongoing.
Conclusion
The containerization effort improved environment delivery speed, resource utilization, and cost efficiency, while exposing challenges in networking, isolation, image management, and debugging. Production rollout is in early stages, and further experience will be shared.
References
https://github.com/fabianenardon/docker-java-issues-demo
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
