Big Data 13 min read

How DataOps and Linear Programming Optimize MaxCompute Capacity Management

This article explains how Alibaba's MaxCompute platform tackles capacity bottlenecks by combining data‑driven insights, linear programming, and automated project migration strategies to predict resource needs, optimize cluster allocation, and quantify migration impacts for improved operational efficiency.

Efficient Ops

Aug 29, 2018

How DataOps and Linear Programming Optimize MaxCompute Capacity Management

MaxCompute (formerly ODPS) is Alibaba's sole self‑developed big‑data platform that stores irreplaceable transaction data in a logically unified data pool spanning exabytes, while physically consisting of massive multi‑region clusters, leading to severe resource bottlenecks.

—Clusters have multiple resource dimensions (compute, storage, file count) and inter‑cluster bandwidth costs vary with distance. —Applications (projects) consume resources differently, especially compute resources that show strong time‑of‑day variation. —Applications also depend on each other’s data, consuming inter‑cluster bandwidth; as the number and scale of clusters grow, these dependencies become increasingly complex.

Problem Statement

When any cluster reaches a resource bottleneck (compute, storage, or file count), it impacts the entire business and reduces overall efficiency, creating a capacity‑management challenge for MaxCompute. Two main solutions exist: purchasing additional machines (requiring accurate future‑resource forecasts) and optimizing application placement across clusters through global or local migration strategies.

Global placement reshuffles all online applications across clusters, while local migration targets only the clusters under pressure. Due to limited automation, migration efficiency, and business constraints, implementing global placement is difficult.

We have made progress in both budgeting for procurement and optimizing local migration; the following sections detail a data‑plus‑algorithm approach for the latter.

1. Limitations of Manual Decision‑Making

Historically, migration decisions relied heavily on operators’ experience, which suffers from several drawbacks:

Operators must monitor many resource dimensions daily, risking missed or delayed migration needs.

Numerous projects with multi‑dimensional attributes and varying data dependencies make exhaustive manual evaluation infeasible.

Migration impacts are hard to predict; operators tend to adopt conservative plans to avoid resource shortages or excessive inter‑cluster bandwidth usage.

2. DataOps‑Driven New Thinking

While operator expertise is valuable, leveraging the abundant data generated by a big‑data platform enables more precise, timely, and quantifiable migration decisions, offering:

Early detection of migration demand.

Fine‑grained optimization of project migration strategies.

Accurate quantitative evaluation of migration outcomes.

3. Optimization Algorithm Design

3.1 Overall Solution Architecture

The solution consists of three steps:

Determine and quantify migration demand at the cluster level (which clusters need resources and how much).

Solve a linear‑programming model to select the optimal project‑to‑cluster assignments.

Quantify the impact of the migration on resource levels and inter‑cluster bandwidth, providing visual feedback for operators.

3.2 Cluster‑Level Demand Determination and Quantification

Each cluster has distinct server scale, compute, storage, and elasticity. By combining operator expertise with historical usage, we define personalized "migration‑required" and "receivable" watermarks for each resource dimension.

Peak‑hour usage is the primary focus; comparing peak watermarks against total capacity yields concrete migration quantities and receiving capacities.

If total demand exceeds total receiving capacity, the platform signals a need for capacity expansion; meanwhile, the algorithm generates temporary migration plans to alleviate immediate pressure.

3.3 Linear‑Programming Problem Definition and Solving

After establishing cluster‑level demand and capacity, we formulate a 0‑1 integer linear‑programming problem where decision variable x indicates whether a project moves from cluster A to B (1) or not (0). The objective minimizes bandwidth cost (distance‑related) and data‑dependency penalties, while constraints enforce resource limits for both source and destination clusters.

Typical solution methods include simplex, interior‑point, and specialized solvers such as CPLEX or Gurobi, accessed via tools like MATLAB or SciPy.

4. Quantitative Estimation of Migration Impact

Migration impact is evaluated in three dimensions:

Storage and file‑count levels – relatively stable, thus impact is easy to estimate.

Compute resource levels – highly time‑variant; we distribute each project's CPU consumption across hourly slots to forecast 24‑hour CPU water‑level changes.

Inter‑cluster traffic – data dependencies are incorporated into the objective function; after solving, we simulate the new cross‑cluster replication list and estimate bandwidth changes using historical traffic data.

5. Application Results

The intelligent migration feature provides operators with precise, visualized assessments of each migration task, including:

Optimal migration plan generated by the algorithm.

Projected before‑and‑after changes in cluster compute and storage watermarks.

Estimated changes in inter‑cluster traffic.

Projected variation in cross‑region bandwidth utilization.

Conclusion

Project migration optimization demonstrates a DataOps‑driven approach to capacity management for large‑scale offline big‑data workloads. By integrating multiple resource dimensions into a linear‑programming model, the system can find optimal migration strategies under given constraints and quantify their effects.

Future work includes applying machine‑learning and time‑series forecasting to predict cluster resource consumption, enabling proactive procurement and migration planning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

resource optimization Linear Programming MaxCompute DataOps

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.