Cloud Computing 10 min read

Cougar: A General Framework for Jobs Optimization in Cloud

Cougar is a cloud‑native, multi‑objective optimization framework that unifies metadata and monitoring ingestion to improve resource efficiency and performance for large‑scale AI and big‑data jobs, demonstrating over 50% CPU‑memory savings and stable latency in production experiments.

AntTech
AntTech
AntTech
Cougar: A General Framework for Jobs Optimization in Cloud

In recent years, rapid advances in AI, data processing, and the growing scale of datasets have highlighted the need for efficient computing systems, especially under green‑energy constraints. Ant Group’s Green Computing team therefore built a cloud‑distributed job management and optimization system called Cougar, which aims to provide high efficiency and high stability.

The work describing Cougar, titled "Cougar: A General Framework for Jobs Optimization In Cloud," has been accepted by the IEEE International Conference on Data Engineering (ICDE 2023), a top‑tier CCF‑A conference.

With the continuous development of cloud‑native technologies, the deployment of compute tasks on the cloud has become increasingly convenient and large‑scale. In 2022, Ant’s production environment hosted AI and big‑data tasks consuming two million CPU cores across hundreds of thousands of jobs. Over‑provisioned resources and runtime traffic fluctuations lead to waste, increased latency, and even failures, making resource‑aware performance optimization a critical challenge.

Existing industry solutions typically rely on elastic scaling based on historical statistics or time‑series predictions, and performance optimizations are often tightly coupled with specific compute engines, limiting reusability and causing conflicts among multiple optimization objectives.

To address these issues, the Ant Green Computing team created Cougar, a general framework that provides unified metadata and monitoring data ingestion, decoupling optimization decisions from compute engines and supporting multi‑objective optimization across performance, resource usage, and stability.

Overall Solution

The framework is designed around four main goals: (1) unified metadata and monitoring data management to enable reusable optimization algorithms; (2) support for complex online optimization scenarios, including one‑time initialization, periodic runtime tuning, coordination of multiple optimizers, and dynamic decision making based on previous results; (3) efficient and flexible optimization control using an Event‑&‑Task abstraction where each Event represents an optimization scenario and Tasks execute distributedly on serverless modules; (4) high extensibility, allowing users to plug in custom modules for workflow control, algorithm execution, and coordination.

Figure 1: Architecture of the Cougar framework

Optimization Scenarios

Cougar implements end‑to‑end optimization for both single‑task and multi‑task scheduling. In the single‑task case, the lifecycle is split into initialization and runtime phases. During initialization, historical profiling and Bayesian‑based offline bucket trials generate configuration recommendations to reduce cold‑start overhead. At runtime, a multi‑objective optimizer (e.g., NSGA‑II) combines rule‑based and predictive objective functions to coordinate multiple single‑objective algorithms, delivering a balanced solution that converges quickly. Figures 2 and 3 illustrate the optimization pipeline and the trade‑off surface between performance (Tt) and resource usage (Tr) for DAG‑type distributed jobs.

Figure 2: Single‑task optimization workflow

Figure 3: Performance‑resource trade‑off surface

In the multi‑task scheduling scenario, Cougar performs mixed‑placement decisions by jointly considering online inference and offline training workloads. Compared with conventional over‑commitment strategies, Cougar maintains SLA guarantees while improving overall resource utilization, as illustrated in Figure 4.

Figure 4: Multi‑task mixed‑placement algorithm

Experimental Results

We evaluated Cougar on two large‑scale production workloads. In a data‑processing scenario, 3,000‑core streaming jobs were randomly selected from Ant’s online fleet. Compared with state‑of‑the‑art industry solutions, Cougar reduced CPU and memory consumption by more than 50% without degrading throughput (see Figure 5).

Figure 5: CPU/Memory reduction for Flink Join case

In an online service scenario, 20,000‑core inference tasks (both single‑model and multi‑model) were randomly chosen and tested under varying traffic loads. Cougar achieved an overall 15% reduction in replica count and CPU usage while keeping latency within normal bounds throughout a seven‑day continuous optimization period (see Figure 6).

Figure 6: Replica and CPU reduction for online inference

Conclusion & Future Plans

Across multiple AI and big‑data compute scenarios, Cougar’s multi‑objective optimization consistently delivers higher compute throughput per unit of resource compared with common industry approaches. The system is now being integrated into various Ant internal services such as inference, retrieval, Flink, and data integration, contributing to greener, more elastic computing. In 2023, Cougar and related green‑computing components will be gradually open‑sourced to help developers build efficient, carbon‑aware compute pipelines.

artificial intelligencebig datacloud computingresource managementmulti-objective optimizationjob optimization
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.