Fundamentals 14 min read

Analysis of Arm's 2023 Cortex‑X4, A720, and A520 Microarchitectures

Arm’s 2023 processor lineup—Cortex‑X4, A720, and A520—introduces a 15% performance boost, 20‑22% efficiency gains, a 64‑bit‑only Armv9.2 ISA with QARMA3 PAC, larger caches, expanded decode and execution resources, and a DSU120 module supporting up to 14 cores and 32 MiB L3.

OPPO Kernel Craftsman

Jul 7, 2023

Analysis of Arm's 2023 Cortex‑X4, A720, and A520 Microarchitectures

In May 2023 Arm released its next‑generation processor lineup: the high‑performance Cortex‑X4, the efficiency‑focused A720, and the small‑core A520. This article reviews the architectural changes of these cores, highlights the new Armv9.2 ISA, and discusses the updated DSU120 system‑level module.

The three cores target different goals. Cortex‑X4 aims for a 15% performance uplift over Cortex‑X3, while A720 and A520 focus on 20% and 22% energy‑efficiency improvements respectively, all on the same TSMC 4 nm process.

Arm also introduced the Armv9.2 instruction set, adding the QARMA3 PAC algorithm, expanded floating‑point capabilities, and PMU enhancements. Notably, all three new cores drop 32‑bit support.

The DSU120 module now supports up to 14 cores and up to 32 MiB of L3 cache, improving inter‑core data management.

Cortex‑X4 Microarchitecture

Code‑named Hunter‑ELP, Cortex‑X4 expands the front‑end by removing the L0 MOP cache, increasing the number of decoders from 6 to 10, and unifying the pipeline width to 10‑wide. The pipeline depth is reduced from 11 to 10 stages after the L1 cache fetch.

Back‑end changes include an extra branch unit (3 → 4), two additional ALUs (6 → 8), a second full‑width MAC ALU, and a 20% larger reorder buffer (ROB) from 320 to 384 entries.

The AGU configuration changes to 1 LS AGU, 2 LD AGU, and 1 ST AGU (total 4 AGU). The L1 d‑TLB entries double from 48 to 96. L2 cache capacity doubles from 1 MiB to 2 MiB, which reduces refill and write‑back rates per thousand instructions.

Performance figures show a double‑digit increase in SPECint2K7 (≈13‑14%), modest 6‑8% gains in Geekbench, and a more noticeable uplift in the L2‑sensitive Sppdometer2 benchmark.

Key Cortex‑X4 changes:

Removal of L0 MOP cache

Decoders increased to 10

Pipeline unified to 10 stages

Branch units: 2 → 3

ALU units: 6 → 8

Additional AGU unit

ROB size: 320 → 384

L1 d‑TLB: 48 → 96 entries

L2 cache: 1 MiB → 2 MiB

No 32‑bit support

A720 Microarchitecture

Code‑named Hunter, A720 targets a 20% efficiency gain over A715 while keeping power consumption similar. Front‑end improvements focus on branch‑prediction latency (recovery cycles reduced from 12 to 11) and power‑optimized unconditional/conditional prediction.

Back‑end adds pipelined FDIV/FSQRT units, optimizes data movement between integer and floating‑point units, and refines the issue queue and AGU pathways.

L2 cache latency drops from 10 to 9 cycles, and the maximum L2 size remains 512 KB.

A new “A720min” variant offers a smaller die comparable to Cortex‑A78, delivering ~10% higher performance than A78 while maintaining similar power characteristics.

Key A720 changes:

Branch‑prediction recovery: 12 → 11 cycles

L2 latency: 10 → 9 cycles

Introduction of A720min (A78‑sized core with ~10% better performance)

A520 Microarchitecture

Code‑named Hayes, A520 is a 64‑bit only efficiency core derived from the A510 design. It removes one ALU (3 → 2) and adds the QARMA3 PAC algorithm to keep PAC overhead below 1%.

Arm claims a 22% power reduction at equal performance, or an 8% performance boost at equal power.

Key A520 changes:

ALU count reduced from 3 to 2

QARMA3 PAC algorithm introduced

64‑bit only, no 32‑bit support

Significant energy‑efficiency improvements

DSU120 Module

The updated DSU120 can manage up to 14 cores and up to 32 MiB of L3 cache within a single cluster. It also provides an L3 power‑gating feature to reduce static leakage when large caches are not needed.

Overall Summary

Arm’s 2023 releases demonstrate a clear trend toward larger, higher‑performance cores (Cortex‑X4) combined with efficiency‑focused cores (A720, A520) that improve power‑per‑watt while dropping legacy 32‑bit support. The architectural refinements—more decoders, larger ROB, expanded AGU set, and bigger L2 caches—translate into measurable performance gains, especially in SPECint2K7 and L2‑sensitive workloads. Developers and system designers should consider these changes when optimizing software stacks and power‑management strategies for next‑generation mobile and embedded devices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Arm performance-analysis CPU architecture microarchitecture A520 A720 Cortex-X4

Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.