Cloud Computing 11 min read

Mesos and Chronos Deployment at iQIYI: Building a Private Cloud Platform and PaaS

At iQIYI, the author led a private‑cloud platform built on Mesos and Chronos that runs millions of weekly containers for transcoding and analytics, then layered a feature‑rich PaaS called QAE on Marathon, highlighting IaaS vs PaaS benefits and outlining future micro‑service, GPU, and scaling enhancements.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Mesos and Chronos Deployment at iQIYI: Building a Private Cloud Platform and PaaS

Author: luffy – currently responsible for iQIYI’s private cloud platform (virtual machines and containers). Previously contributed to MeeGo and Tizen at Intel, and to open‑source projects such as systemd, dbus, Mesos, and Chronos.

iQIYI operates roughly 2,000 physical machines across several data centers; a single Mesos cluster can contain up to 600 nodes. The Mesos platform launches more than 5 million containers per week, with a peak of about 20,000 concurrent containers. Resource oversubscription raises average CPU utilization to over 20 % (over 50 % on the busiest clusters), a notable achievement for an internet company.

Mesos underpins critical services: video/audio/image transcoding, Storm, and Spark real‑time analytics, among others.

Since late 2013 iQIYI evaluated Mesos (inspired by Google Borg). In early 2014 Mesos 0.16.0 was deployed in production with a custom short‑task scheduler for transcoding. Later, the team migrated to Chronos, a more mature short‑task scheduler, and contributed over 30 commits to the Chronos community (the internal fork was named “Sisyphus”).

By mid‑2015 Chronos fully supported the company’s transcoding workload, and Hadoop on Mesos was briefly trialed before moving to YARN. Storm was integrated into Mesos, while Spark on Mesos was not adopted due to stability concerns at the time.

In parallel, iQIYI investigated Marathon and built a proprietary PaaS called QAE (iQIYI App Engine) on top of it. QAE offers a self‑service container cloud with features such as role‑based access control, app and container monitoring, non‑intrusive custom metrics, subscription‑based alerts, auto‑scaling, gray‑release, AB testing, advanced placement strategies, health checks, log management, historical container inspection, web console, and CI/CD automation.

The author distinguishes IaaS (infrastructure‑as‑a‑service) from PaaS: IaaS provides raw compute resources with limited added value, while PaaS delivers higher‑level developer‑centric capabilities, making it a better fit for internal developers.

Looking forward, QAE is evolving toward micro‑service support. Although containers are often paired with micro‑services, the author emphasizes that containers are merely a deployment mechanism; micro‑service infrastructure (API gateways, APM, service discovery, tracing, rate limiting, etc.) is a separate, essential layer.

Future work includes:

Fine‑grained authentication for hosted compute clusters.

Dynamic detection of cluster capabilities (e.g., HOST network mode, large‑CPU containers).

Replacing Docker daemon with Mesos unified containers.

Switching from Device‑Mapper to OverlayFS for better concurrency.

Implementing dynamic Mesos oversubscription (in collaboration with Nanjing University).

Mixing offline and online workloads on Mesos.

Unified GPU resource management.

Integrating public‑cloud resources to accelerate cluster scaling.

The article concludes with a reflection on the progress made and the long road ahead, encouraging readers to share the experience.

Cloud ComputingmicroservicesContainerPaaSChronosMesos
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.