Introducing AMD and ARM Bare‑Metal Instances for iQIYI Big Data Computing: Cloud Selection, Performance Evaluation, and Heterogeneous Scheduling
To reduce costs and boost compute density, iQIYI's big data team migrated from aging private‑cloud Intel servers to public‑cloud AMD and ARM bare‑metal instances, establishing a systematic machine‑selection process, performance testing framework, and YARN‑based heterogeneous scheduling to fully leverage the new hardware.
Under a cost‑reduction goal, iQIYI's big data computing team launched public‑cloud AMD and ARM bare‑metal machines starting in 2024 to accelerate hardware iteration, increase per‑node compute density, and dramatically improve cost‑performance.
Previously the team relied on private‑cloud physical servers equipped with several‑year‑old Intel CPUs, which suffered from outdated configurations. Observing the strong performance of AMD and ARM in data‑center servers, they decided to evaluate these newer architectures.
During the machine‑selection phase, the team weighed the pros and cons of private versus public clouds and considered the future evolution of iQIYI's big data platform. Key reasons for choosing public‑cloud instances included flexible leasing periods that match business cycles, rapid hardware refresh cycles, and the ability to select configurations with higher CPU‑to‑memory ratios (moving from 1 core : 4‑5 GB to 1 core : 8 GB) and over‑subscription deployment (2 vcores = 1 core at the OS level) to raise CPU utilization.
After evaluating dozens of instance types from multiple cloud providers—covering Intel Xeon, AMD Genoa, ARM CPUs, various storage options, and both VM and bare‑metal forms—the team selected AMD and ARM bare‑metal machines. While AMD shares the x86 architecture and requires no special adaptation, ARM’s aarch64 architecture demanded substantial component and code modifications, which the article details.
The migration process comprised four stages: (1) public‑cloud machine selection, (2) big‑data component adaptation, (3) business‑code compatibility testing and refactoring, and (4) heterogeneous scheduling.
2.1 Machine Admission Thresholds – The team defined coarse‑grained filters based on current cluster metrics such as required CPU cores, disk capacity, per‑core memory, disk and network bandwidth, and total disk bandwidth. These thresholds dramatically reduced the number of candidate machines for testing.
2.2 Reproducible Standardized Performance Tests – Tests included basic CPU and disk benchmarks (integer/float workloads, dd , fio ) and a representative TPC‑DS SQL workload (10 TB dataset). Only machines that outperformed the existing private‑cloud baseline proceeded to production‑load testing.
2.3 Production Load Performance Quantification – Mixed‑cloud nodes were run for a week, with Spark SQL tasks collected and aggregated per core‑second ( total input bytes / total (vcore * seconds) ). The observability system tagged nodes (e.g., vendor=xxx_cloud,cpuarch=aarch64 ) and visualized performance across different instance types (see images).
Figure 1: Performance Observation of Different Instance Types
Based on the performance index and price index, the team calculated a cost‑performance score for each instance and selected the most economical machines.
3 ARM Compatibility Adaptation
Because ARM’s architecture differs from x86, the following adaptations were required:
Hadoop 3.2.2: compiled RPMs for ARM, ensured native HDFS libraries, and upgraded leveldbjni to ARM‑compatible versions.
Flink Connectors: adapted more than 20 connectors.
Spark 3.5.0, Iceberg, Paimon: largely compatible with minimal changes.
Business tasks were categorized into Jar, SQL, and DAG configurations. Jar tasks often contained ARM‑incompatible dependencies (e.g., low‑version Netty, native C++ libraries). Scanning of online Jars revealed a 10 % incompatibility rate; the team provided upgrade guidance for Netty, Couchbase client, HBase client, Jansi, etc., and added a pre‑execution check in the development platform.
4 Heterogeneous Scheduling
To direct workloads to appropriate nodes, YARN Node Attributes and Placement Constraints were employed. The team extended Spark, Flink, and MapReduce to honor these constraints (see table). Example Spark configuration:
--conf "spark.yarn.schedulingRequestEnabled=true" \
--conf spark.yarn.executor.placement-constraint.spec="AND(lifecycle=reserved:staticresource=true)"Table of engine adaptations:
Compute Engine
Adaptation
Placement Constraint Parameter
Spark
Extended YarnAllocator to support placement constraints, configurable via flags.
--conf "spark.yarn.schedulingRequestEnabled=true" --conf spark.yarn.executor.placement-constraint.spec="AND(lifecycle=reserved:staticresource=true)"Flink
Implemented YarnResourceManagerDriver and SchedulingRequestReflector for compatibility.
-Dyarn.taskmanager.placement-constraint.spec="AND(lifecycle=reserved:staticresource=true)"MapReduce
Replaced resource request with scheduling request in RMContainerRequestor.
-D mapreduce.placement-constraint.spec="lifecycle=reserved,staticresource=true"Because the default YARN CapacityScheduler uses a round‑robin strategy, low‑core nodes were saturated first, leaving high‑core ARM/AMD nodes under‑utilized. The team implemented a custom balancing policy that sorts nodes by vcore usage (ascending) and gives priority to high‑capacity nodes while throttling low‑performance nodes when they reach a high usage watermark.
Figure 5: Round‑Robin Resource Allocation
Figure 6: Custom Balancing Scheduling Strategy
After optimization, high‑performance nodes are preferentially utilized during peak periods, while low‑performance nodes operate at lower allocation levels.
5 Deployment Results
ARM nodes achieve stable CPU utilization up to 90 % (versus ~70 % for x86), and Spark/Flink job runtimes improve by over 30 %. More than ten thousand ARM cores have been deployed and run stably for over six months.
6 Future Plans
The team will continue to increase the share of high‑cost‑performance AMD/ARM resources, forming a hybrid‑cloud tiered compute pool with fixed (X %), elastic (Y %) and long‑term elastic (Z %) segments (X + Y + Z = 100 %). They are also developing multi‑cloud unified scheduling and caching mechanisms to keep data traffic within the same cloud and reduce cross‑cloud bandwidth consumption.
Additional references: iQIYI Big Data Offline‑Online Mixed Deployment , iQIYI Multi‑AZ Unified Scheduling Architecture , and Alluxio Practice at iQIYI .
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.