Cloud Native 24 min read

How 360 Scales Advertising with Mesos & Docker: Lessons in Cloud‑Native Ops

An SRE engineer from 360 shares how Mesos and Docker containerization solved data‑center migration, fault recovery, OS inconsistencies, scaling, and resource inefficiencies in the company's advertising platform, detailing architecture, deployment, networking, storage, service discovery, and future plans for cloud‑native operations.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How 360 Scales Advertising with Mesos & Docker: Lessons in Cloud‑Native Ops

Background

The talk was given by Li Dong, a senior SRE engineer at 360, on applying Mesos and Docker container technology to the 360 commercial advertising system to address various business pain points.

Why Container?

Containers provide a standard, stable, and scalable execution environment. Unlike Docker alone, Mesos containerizer and other engines (e.g., CoreOS RKT) avoid Docker‑specific dependencies and improve reliability.

Business Pain Points

Data‑center migration challenges and environment inconsistencies.

Fault recovery when physical servers fail.

Operating‑system version differences (CentOS 5/6/7).

Inconsistent production configurations.

Inconsistent test environments.

Low service scalability.

Poor server‑resource utilization.

Docker Benefits

Docker’s “container” metaphor standardizes deployment, enabling portable, repeatable launches (build, ship, run anywhere). Standardization is achieved through Dockerfiles, docker run , and immutable image layers, ensuring idempotent environments.

Docker Standardization Details

Dockerfile defines software versions.

docker run provides a uniform start command.

Image layers are immutable; each redeployment discards the writable layer, guaranteeing consistent environments.

Potential Issues with Containerization

Running SSH inside containers is discouraged; exposing services like rsync or puppet agents inside containers is unnecessary and insecure.

Logging and Monitoring

Logs are shipped via Docker’s syslog module over UDP to Graylog/Grafana for real‑time analysis.

Docker Network Performance

Two network modes are used: Host‑only (performance comparable to bare metal) and Bridge. Calico is employed for assigning external IPs or network isolation when needed.

Image Registry

Initially a single‑node Docker registry with no authentication was used. Currently Harbor with S3 backend provides high‑availability, authentication, and multi‑region replication; CDN integration is planned.

Data Persistence

Stateful services (CephFS, MySQL, Kafka, Aerospike, Redis) are mounted into containers. Kafka handles streaming data; CephFS provides persistent storage for other services.

Service Registration and Discovery

Mesos‑DNS and Marathon handle dynamic service discovery. When a Mesos‑slave fails, tasks are rescheduled on other nodes, and DNS updates reflect new IPs.

Why Mesos?

Mesos offers high resource utilization by dynamically allocating resources across workloads (e.g., advertising spikes vs. offline Hadoop jobs). It supports resource tagging, flexible scheduling, and modular extensibility via frameworks.

Mesos Architecture

Master nodes (highly available via ZooKeeper) receive resource offers from agents and forward them to frameworks (e.g., Marathon, Flink). Agents execute tasks via executors.

Mesos Fault Recovery

Leader election through ZooKeeper enables master failover within ~10 seconds; frameworks and agents re‑register automatically.

Mesos Ecosystem

Key frameworks include Marathon (long‑running jobs, service discovery, health checks, web UI, resource constraints, labels, data persistence, graceful shutdown, GPU scheduling), Marathon‑lb (HAProxy‑based load balancer), and Mesos‑DNS.

Mesos vs. YARN

Mesos schedules diverse resources (CPU, memory, ports, disks) while YARN focuses on CPU and memory. Mesos delegates task scheduling to frameworks; YARN handles both resource and task scheduling.

Mesos vs. Kubernetes

Mesos is a resource‑management kernel; Kubernetes is a container‑orchestration platform. Mesos offers stable, infrequent releases and modular frameworks; Kubernetes provides a richer native feature set but updates more frequently.

360 Use Cases

Since 2015, 360 adopted Mesos for large‑scale resource pooling, deploying services such as Chronos, Marathon, and custom frameworks across two data centers (1000+ nodes, 5000+ tasks). Applications include:

Service deployment and scaling via Marathon image updates.

Automatic fault recovery across slaves.

Service degradation strategies under resource pressure.

Real‑time service discovery with Mesos‑DNS and Marathon‑lb.

Storm Cluster Containerization

Supervisors run as containers on Mesos‑slaves; Mesos‑DNS resolves Nimbus IPs, enabling dynamic scaling and high availability.

Image Service Containerization

PHP7‑based image service is containerized; auto‑scaling based on load thresholds (e.g., expand from 10 to 15 instances when CPU > 10 for 10 consecutive checks, cap at 36 instances). Monitoring uses Graylog via UDP.

Other Services

Web services, Aerospike, Marathon‑lb, Kafka MirrorMaker, Redis, and CephFS are all containerized, providing high availability and resource efficiency.

CI/CD

GitLab Runner builds Docker images, pushes them to Harbor, and updates Marathon via API for automated deployments.

Future Plans

Expand CephFS usage for high‑write workloads, explore Calico for advanced networking, and investigate real‑time machine‑learning pipelines supported by Mesos/Marathon.

Conclusion

Mesos and Docker enable 360 to achieve high resource utilization, fault tolerance, and scalable, cloud‑native operations for its advertising platform.

Cloud NativeDockeroperationsContainerizationMesos
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.