Big Data 7 min read

Unlocking Alibaba’s Massive Cluster Data V2018: A Treasure Trove for Big‑Data Research

Alibaba has released the comprehensive Cluster Data V2018 dataset, detailing eight days of operation for 4,000 servers and their mixed online and offline workloads, including DAG information, enabling researchers to study large‑scale data‑center performance, resource utilization, scheduling algorithms, and derive new insights.

Alibaba Cloud Developer

Dec 20, 2018

Unlocking Alibaba’s Massive Cluster Data V2018: A Treasure Trove for Big‑Data Research

Dataset Overview

Alibaba recently opened a real‑world computer‑cluster dataset called Alibaba Cluster Data V2018 . The data records detailed information about servers and the tasks running on them in a production cluster, aiming to bridge the gap between academia and industry.

The dataset contains six files, compressed to nearly 50 GB (over 270 GB uncompressed), covering 4,000 machines, their online application containers, and offline compute tasks for eight days.

What the Data Can Be Used For

Understand the characteristics of servers and task execution in a modern data‑center.

Test scheduling, optimization, and other task‑management algorithms and write research papers.

Perform data analysis to uncover patterns not previously observed.

Examples of research questions include:

How to improve overall resource utilization when workload demand fluctuates between day and night?

What is the maximum dependency depth of DAGs in the cluster?

What is the typical lifetime of a container?

Do multiple instances of the same task run for the same duration?

Key Differences in V2018

The new version expands on V2017 in two major ways:

DAG information added : Offline tasks now include Directed Acyclic Graph (DAG) details, representing the largest production‑grade DAG data publicly available.

Scale increased : While V2017 covered ~1,300 machines for ~24 hours, V2018 covers 4,000 machines for eight days.

Understanding DAGs:

Offline compute jobs such as MapReduce, Hadoop, Spark, and Flink are expressed as DAGs, capturing parallelism and dependencies. Below is an example DAG.

Impact and Community

The first release (V2017) already led to several high‑quality academic papers, including award‑winning work at OSDI and publications at BIGDATA, APSys, and SoCC. The V2018 dataset is expected to enable further breakthroughs in resource management, scheduling, and workload characterization.

Researchers interested in using the data can download it by replying “Dataset” to the official Alibaba Technology WeChat account or by following the provided download link.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data DAG Scheduling dataset resource utilization cluster workload

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.