Cloud Native 9 min read

How iQIYI Uses Dragonfly and Koordinator to Optimize Offline‑Online Mixed Workloads

This article details iQIYI's multi‑year journey of mixing offline and online workloads using Dragonfly and Koordinator, covering architectural evolution, key factors for successful co‑location, resource‑allocation strategies, the role of Anolis OS, pilot results, and future directions.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How iQIYI Uses Dragonfly and Koordinator to Optimize Offline‑Online Mixed Workloads

iQIYI Offline‑Online Mixed Workload Background

Like many internet companies, iQIYI runs three main load types: business applications (stateful and stateless), databases & middleware, and offline tasks. Stateless services are suitable for mixed deployment, while stateful services are more complex. Offline tasks such as nightly transcoding focus on throughput rather than latency and are also candidates for co‑location.

Historical Exploration of Mixed Deployment

In 2013 iQIYI first attempted compute‑storage co‑location. After adopting containers, workloads—including online content production, Spark, and Storm—were placed together on Mesos without isolation, leading to resource contention. Docker‑based attempts later forced a split of services by node and cluster, which caused chronic under‑utilization of online clusters (night‑time utilization often in single digits) and resource shortage for offline jobs.

In 2016, Mesos oversubscription introduced a counter‑based mechanism that classified tasks as latency‑sensitive or best‑effort, but fine‑grained isolation issues halted progress.

With the rise of Kubernetes, iQIYI leveraged its scaling capabilities to mix workloads directly, using Kata containers to preserve service quality.

In 2022, Dragonfly and Koordinator were introduced together to build the next‑generation mixed‑deployment architecture.

Key Factors Influencing Mixed Deployment

Service quality, especially for online services; without guaranteed quality, mixing is meaningless.

Acquiring additional resources.

Task adaptation to the mixed environment.

Strategies for Acquiring Extra Resources

Two main approaches are used:

Single‑counter overselling: allocate resources at a fixed over‑commit ratio or based on empirical distribution across workload types.

Multiple‑counter schemes: either predict idle time and resources from historical data or employ a real‑time detection similar to Mesos oversubscription.

Service‑Quality Strategies

Quality strategies are divided into static (fixed once deployed) and dynamic (adjusted at runtime based on offline or per‑process conditions).

Koordinator Overview

Koordinator does not fundamentally alter the distributed architecture but adds cloud‑native abstractions for workload types, making it possible to use Kubernetes as a generic distributed platform rather than a custom‑built solution.

It extends Kubernetes with an additional scheduler and a node‑level component called Koordlet that collects resource metrics and enforces task isolation.

The allocation mechanism uses counters to perform a second‑level distribution based on real utilization, allowing the water‑line to be tuned dynamically while preserving online service quality.

Task Types in Koordinator

Koordinator defines five task categories (four commonly shown) and provides tiered guarantees for online versus offline workloads.

Anolis OS Enhancements

iQIYI adopted the Anolis OS (Dragonfly community OS) to further improve mixed deployment. Two features are highlighted:

Group Identity : enables two separate process schedulers—one for online services (high priority) and one for offline tasks—preventing fine‑grained resource contention.

CPU Burst : smooths context switches in fair scheduling, reducing performance spikes.

Pilot Deployments and Results

The first pilot involved a real‑time content production service fully running on mixed resources, achieving zero‑cost reuse of idle servers and stable operation without unacceptable impact on online services.

A heavy offline workload—multiple‑pass video transcoding for bitrate reduction or quality improvement—is currently in a gray‑scale validation phase, expected to benefit from Anolis OS and Koordinator.

For large‑scale offline analytics, iQIYI continues to use Kata as the runtime and is exploring integration with Koordinator.

Performance measurements show an overall CPU utilization increase of more than 50% after the pilot, with peak‑period usage staying below the water‑line due to under‑utilized BE task requests. Further research aims to enable multi‑level (three‑fold, four‑fold, or unlimited) resource sharing.

Future Work

iQIYI plans to collaborate closely with the Dragonfly community to push CPU utilization beyond 50%, address multi‑tenant resource allocation challenges for offline pools, and continue improving offline task quality guarantees.

cloud nativeKubernetesresource schedulingOffline ComputingKoordinatorAnolis OS
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.