Inside 360’s Ultron: How OpenStack Powers a Scalable Private Cloud
This article details the evolution, architecture, deployment, monitoring, and performance optimization of Ultron—360’s internal OpenStack‑based virtualization platform—covering its three development stages, technical stack, automation with Ansible, advanced features like VXLAN and Ceph, and lessons learned from large‑scale operations.
360 Internal Virtualization Platform Development Stages
The platform progressed through three stages: (1) an early virtualization phase with immature technology and limited performance; (2) a mature phase where the virtualization stack became rich and cloud‑computing concepts accelerated growth; (3) an application‑centric phase focused on containerization and micro‑services.
Ultron Overview
Ultron is the internal IaaS platform built by 360 Web Platform Department since July 2015 on OpenStack, forming the core of the HULK cloud platform and delivering high‑performance, stable virtual machines for various business scenarios.
The name references Marvel’s Ultron to convey the goal of controlling thousands of machines with fast scaling and elasticity.
Technical Stack
Platform: OpenStack components Keystone, Glance, Nova, Neutron, Cinder
Storage: Combination of local storage and Ceph shared storage
Network: Mixed VLAN and VXLAN modes
Extensive integration with internal systems such as account management, CMDB, and network services was performed.
Ultron Evolution
From project kickoff in July 2015 to the first VM creation in November (four months), the platform initially used local storage + VLAN. Later, Ceph shared storage and VXLAN were introduced, followed by performance tuning and a major OpenStack version upgrade (Kilo → Mitaka). In 2017 the focus shifted toward containerization.
Current Usage
Ultron now supports over 90% of online services, spans nine data centers (Beijing, Shanghai, Guangzhou, Zhengzhou, Langfang, etc.), and runs 1,183 physical nodes hosting 5,944 virtual machines with a mixed local/Ceph storage model.
Why OpenStack?
OpenStack is open source, mature, and backed by a large community—second only to Linux—making it the de‑facto standard for IaaS.
OpenStack Architecture
Control nodes operate in an Active/Active high‑availability mode; RabbitMQ uses a mirrored setup; VM HA can be provided by the application layer for local storage or by Ceph for shared storage. Hot migration leverages VXLAN and shared storage, enabling near‑instant VM moves without network interruption. Snapshots are created quickly on Ceph because they are performed entirely on the storage side.
Automation and Deployment
Deployment is automated with Ansible, organized into roughly 50 roles covering system initialization, control‑node setup, network‑node setup, and compute‑node setup.
Advanced Features
VXLAN‑based second‑layer networking combined with Ceph enables rapid VM provisioning, second‑level hot migration, minute‑level snapshot creation, and high availability through Ceph’s triple‑replication.
Monitoring
Three‑layer monitoring is in place: (1) OS‑level metrics collected by the internal Wonder system; (2) custom plugins monitor OpenStack component health and API functionality; (3) ELK stack aggregates logs for detailed service diagnostics.
Performance Testing
Rally is used for functional and concurrency performance testing before production rollout.
Optimization
Key optimizations include CPU‑only overcommit, NUMA awareness, Kernel Same‑page Merging (KSM), DPDK for high‑throughput packet processing, and extensive Ceph tuning (RBD cache, cache tier, etc.).
Version Upgrades
A cross‑version upgrade from Kilo to Mitaka was performed at the end of 2016, addressing RPC incompatibilities, kombu version mismatches, and custom patches.
Key Lessons
Plan logical resource zones for flexible scheduling.
Mitigate external constraints through software integration.
Leverage baseline services, ELK logs, and Ansible to improve operational efficiency.
Throttle VM creation and startup concurrency to avoid launch storms.
Model storage requirements based on workload performance, capacity, and scale.
Apply reasonable overcommit settings; monitor disk health to avoid fragmentation and OOM issues.
Ensure RabbitMQ and Erlang versions match to prevent excessive memory consumption.
Q&A Highlights
Answers cover DPDK performance (up to ~1.26 Mpps on 1 Gbps NIC), region design (one region per data center sharing a Keystone), ELK alerting, storage choices favoring Ceph, upgrade policies (feature‑driven), RabbitMQ pitfalls, OpenStack vs. CloudStack comparison, Keystone token type (UUID), container strategy (VM‑first, then containers on VMs, now containers on bare metal), MySQL high‑availability mode, and other operational details.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
