Inside NetEase’s OpenStack Private Cloud: Architecture, Deployment & Tuning
This article introduces OpenStack, outlines its core components, describes NetEase’s private cloud architecture and services, and shares detailed deployment strategies, configuration settings, performance optimizations, and operational lessons learned from years of production use.
OpenStack Overview
OpenStack is an open‑source IaaS implementation composed of interrelated projects that provide compute, storage and networking services. Released under the Apache license, it has attracted over 200 companies and more than 17,000 developers from 139 countries since its inception in 2010.
OpenStack offers a subset of AWS APIs and its own RESTful API. Its loosely‑coupled, highly scalable, distributed architecture, pure Python implementation and active community have made it popular, with a bi‑annual summit gathering developers, vendors and customers worldwide.
Core OpenStack Projects
Compute (Nova) – manages virtual machines and supports multiple hypervisors.
Object Storage (Swift) – distributed, scalable object store.
Block Storage (Cinder) – provides persistent block devices, supporting back‑ends such as Ceph and EMC.
Networking (Neutron) – pluggable, API‑driven network virtualization.
Dashboard (Horizon) – graphical web interface for resource management.
Image (Glance) – registers and serves VM images.
Telemetry (Ceilometer) – usage metering for billing.
Orchestration (Heat) – template‑driven resource orchestration similar to AWS CloudFormation.
Database (Trove) – database‑as‑a‑service.
NetEase Private Cloud Platform Overview
NetEase’s private cloud, developed by the Hangzhou Research Institute, uses the Nova, Glance, Keystone and Neutron components to provide IaaS, PaaS and operational support services, offering virtual machines, networks, disks, object storage, caches, relational and distributed databases, search, messaging, video transcoding, load balancing, container engine, billing, monitoring and management.
The platform has been running stably for over two years, serving more than 30 Internet and gaming products.
Key achievements include raising CPU utilization from under 10 % to about 50 %, cutting operational staff by 50 % through self‑service portals and automation, and improving elasticity to handle traffic spikes.
Deployment Reference Architecture
Keystone stores user data in MySQL and caches tokens with Memcached. All services (nova, glance, neutron) configure the keystone client to use Memcached for token caching.
The cloud is deployed across multiple data centers, providing natural geographic isolation and disaster tolerance. Both nova‑network and neutron are supported.
Multi‑Region Deployment
Each region runs an independent OpenStack deployment with its own image service and network mode (e.g., region A uses nova‑network, region B uses neutron). Keystone is shared for single sign‑on, and regions are connected via an internal network.
Compute and Control Nodes
Hardware is divided into compute nodes (running nova‑network, nova‑compute, nova‑api‑metadata, nova‑api‑os‑compute) and control nodes (which also run nova‑scheduler, nova‑novncproxy, nova‑consoleauth, glance‑api, glance‑registry and keystone). Stateless API services are placed behind HAProxy with Keepalived for high availability.
High‑availability services include a RabbitMQ cluster, master‑slave MySQL and a Memcached cluster.
Network Planning
NetEase uses the FlatDHCPManager+multi‑host mode of nova‑network, allocating VLANs for fixed‑IP VMs, internal floating IPs and external networks. Monitoring and alerting are handled by a custom platform similar to Nagios, focusing on log and process monitoring.
Puppet automates deployment, and StackTach assists in bug localisation.
Key OpenStack Component Configurations
Nova
Critical settings include the metadata API iptables rule generation (my_ip), the API service IP address, HAProxy/Keepalived for novnc proxy high availability, and options for instance auto‑start after host reboot.
Other tunables cover API rate limiting, maximum response size, scheduler filters (Retry, AvailabilityZone, Ram, Core, custom Ecu, ImageProperties, Json), and handling of orphaned VMs via log or reap.
Quota synchronization thresholds and intervals are also configurable to balance accuracy and database load.
Keystone
Keystone can store tokens in SQL databases for persistence or in Memcached for speed; the choice impacts token revocation performance.
Glance
Glance consists of glance‑api and glance‑registry. The number of glance‑api worker processes should match the host’s CPU capacity and request volume. The osapi_max_limit setting controls the maximum items returned per request.
Underlying Software and Performance Optimizations
Virtualization
NetEase selected KVM with libvirt as the compute driver, running on a Debian Wheezy kernel, because of its tight integration with Linux and broad community support.
Kernel
The Debian 3.10.40 kernel was compiled with cgroup options for CPU, memory and blkio QoS, as well as user namespaces to support LXC containers.
CPU Optimization
CPU QoS uses CFS time‑slicing and process pinning; tests showed a 30‑40 % performance difference between different pinning strategies. Reserved CPUs (0‑3) are excluded from VM scheduling.
Memory Optimization
Memory sharing is disabled and transparent huge pages are enabled, yielding roughly a 7 % CPU performance gain in SPEC CPU2006 tests.
I/O Optimization
Disk cache mode is set to “none”, the I/O scheduler uses CFQ (with future plans to evaluate deadline), and a blkio‑based disk I/O QoS mechanism is implemented via libvirt patches. Network I/O is accelerated by enabling vhost‑net.
Operational Experience
Stay current with OpenStack releases; newer versions bring stability and features.
Avoid premature “optimizations” without community consultation.
Reference large‑scale deployment patterns rather than reinventing solutions.
Validate every change in a test environment that mirrors production hardware.
Plan capacity and quota carefully to prevent resource exhaustion.
Coordinate configuration changes with developers and verify via Puppet’s noop mode.
Design network topology (fixed IPs, floating IPs, VLANs) ahead of time.
Prioritize security and isolation to protect tenant workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
