Sky Computing: A Multi‑Cloud Computing Platform for Transparent Resource Utilization
Sky Computing, introduced by Ant Technology Research Institute, proposes a cloud‑agnostic platform that abstracts heterogeneous public and private clouds into a unified service layer, enabling applications to seamlessly migrate workloads across clouds, reduce costs, avoid vendor lock‑in, and support AI training via the SkyML prototype.
In 2021 Ant Group founded the Ant Technology Research Institute, which hosts six labs including databases, graph computing, privacy computing, compilers, and visual intelligence. The institute launched the A‑Talk column to share cutting‑edge research, and this article presents Professor Ion Stoica’s latest work on Sky Computing.
Sky Computing is envisioned as an "Internet for clouds" that turns the current siloed cloud ecosystem into a public computing fabric. It introduces a compatibility set —a collection of services (e.g., Kubernetes, Spark, Snowflake) that can run on multiple clouds—and an intercloud broker that abstracts away individual cloud APIs, creates a two‑sided market between services and user applications, and handles service discovery, scheduling, optimization, and billing.
The broker receives a job description (often a DAG) and user preferences (cost, latency, security), consults a service catalogue, partitions the workload, and dispatches each component to the most suitable cloud (e.g., Azure Confidential Computing for SGX‑based privacy processing, Google TPU for training, AWS Inferentia for inference). This flexible placement can cut costs by ~60% and speed execution by ~47% in the presented BERT fine‑tuning example.
Sky Computing also addresses multi‑cloud challenges such as data egress costs, trust domains, and failure domains. It differentiates itself from existing multi‑cloud efforts by focusing on transparency rather than a uniform API, and by allowing services to exist on a subset of clouds without forcing a common standard.
Early prototypes include SkyML , an intercloud broker for AI training and hyper‑parameter tuning that automatically selects the cheapest cloud resources (Azure, GCP, AWS) based on user deadlines, and Skylark , a data‑movement system that uses overlay routing, hot‑potato routing, and multi‑VM parallelism to achieve up to 4.6× faster cross‑region transfers compared with native cloud tools.
The authors argue that Sky Computing will accelerate cloud adoption, reduce vendor lock‑in, enable specialized clouds (e.g., compute‑optimized NVIDIA or storage‑optimized providers), and open new business models for third‑party service providers and emerging chip vendors. They see it as the next logical step in cloud evolution, akin to how the Internet spurred networking research.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.