Cloud Computing 19 min read

Inside 360’s Ultron: How OpenStack Powers a Scalable Private Cloud

This article details the evolution, architecture, deployment, monitoring, and performance optimization of Ultron—360’s internal OpenStack‑based virtualization platform—covering its three development stages, technical stack, automation with Ansible, advanced features like VXLAN and Ceph, and lessons learned from large‑scale operations.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Inside 360’s Ultron: How OpenStack Powers a Scalable Private Cloud

360 Internal Virtualization Platform Development Stages

The platform progressed through three stages: (1) an early virtualization phase with immature technology and limited performance; (2) a mature phase where the virtualization stack became rich and cloud‑computing concepts accelerated growth; (3) an application‑centric phase focused on containerization and micro‑services.

Ultron Overview

Ultron is the internal IaaS platform built by 360 Web Platform Department since July 2015 on OpenStack, forming the core of the HULK cloud platform and delivering high‑performance, stable virtual machines for various business scenarios.

The name references Marvel’s Ultron to convey the goal of controlling thousands of machines with fast scaling and elasticity.

Technical Stack

Platform: OpenStack components Keystone, Glance, Nova, Neutron, Cinder

Storage: Combination of local storage and Ceph shared storage

Network: Mixed VLAN and VXLAN modes

Extensive integration with internal systems such as account management, CMDB, and network services was performed.

Ultron Evolution

From project kickoff in July 2015 to the first VM creation in November (four months), the platform initially used local storage + VLAN. Later, Ceph shared storage and VXLAN were introduced, followed by performance tuning and a major OpenStack version upgrade (Kilo → Mitaka). In 2017 the focus shifted toward containerization.

Current Usage

Ultron now supports over 90% of online services, spans nine data centers (Beijing, Shanghai, Guangzhou, Zhengzhou, Langfang, etc.), and runs 1,183 physical nodes hosting 5,944 virtual machines with a mixed local/Ceph storage model.

Why OpenStack?

OpenStack is open source, mature, and backed by a large community—second only to Linux—making it the de‑facto standard for IaaS.

OpenStack Architecture

Control nodes operate in an Active/Active high‑availability mode; RabbitMQ uses a mirrored setup; VM HA can be provided by the application layer for local storage or by Ceph for shared storage. Hot migration leverages VXLAN and shared storage, enabling near‑instant VM moves without network interruption. Snapshots are created quickly on Ceph because they are performed entirely on the storage side.

Automation and Deployment

Deployment is automated with Ansible, organized into roughly 50 roles covering system initialization, control‑node setup, network‑node setup, and compute‑node setup.

Advanced Features

VXLAN‑based second‑layer networking combined with Ceph enables rapid VM provisioning, second‑level hot migration, minute‑level snapshot creation, and high availability through Ceph’s triple‑replication.

Monitoring

Three‑layer monitoring is in place: (1) OS‑level metrics collected by the internal Wonder system; (2) custom plugins monitor OpenStack component health and API functionality; (3) ELK stack aggregates logs for detailed service diagnostics.

Performance Testing

Rally is used for functional and concurrency performance testing before production rollout.

Optimization

Key optimizations include CPU‑only overcommit, NUMA awareness, Kernel Same‑page Merging (KSM), DPDK for high‑throughput packet processing, and extensive Ceph tuning (RBD cache, cache tier, etc.).

Version Upgrades

A cross‑version upgrade from Kilo to Mitaka was performed at the end of 2016, addressing RPC incompatibilities, kombu version mismatches, and custom patches.

Key Lessons

Plan logical resource zones for flexible scheduling.

Mitigate external constraints through software integration.

Leverage baseline services, ELK logs, and Ansible to improve operational efficiency.

Throttle VM creation and startup concurrency to avoid launch storms.

Model storage requirements based on workload performance, capacity, and scale.

Apply reasonable overcommit settings; monitor disk health to avoid fragmentation and OOM issues.

Ensure RabbitMQ and Erlang versions match to prevent excessive memory consumption.

Q&A Highlights

Answers cover DPDK performance (up to ~1.26 Mpps on 1 Gbps NIC), region design (one region per data center sharing a Keystone), ELK alerting, storage choices favoring Ceph, upgrade policies (feature‑driven), RabbitMQ pitfalls, OpenStack vs. CloudStack comparison, Keystone token type (UUID), container strategy (VM‑first, then containers on VMs, now containers on bare metal), MySQL high‑availability mode, and other operational details.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringVirtualizationCephDPDKprivate cloudOpenStackAnsible
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.