Comparison of ARM Cortex‑A76, A77 and A78 Microarchitectures
The article surveys ARM’s Cortex‑A76 design—detailing its DSU, cache hierarchy, branch predictor, larger ROB and execution units—then contrasts the A77’s macro‑op cache, larger ROB and wider execution pipelines, and the A78’s 5 nm‑based performance‑power gains, enhanced branch prediction, added MUL unit and expanded store bandwidth.
With the rapid development of smartphones, ARM updates its CPU core designs almost every year. From 2018 to 2020 three generations of Cortex‑A76, A77 and A78 were released, each bringing notable micro‑architectural changes. This article reviews the key blocks of the A76 design and then compares A77 and A78 against it.
1. ARM Cortex‑A76 Microarchitecture
The A76 micro‑architecture can be explored on wikichip (en.wikichip.org). Important blocks include:
DSU (DynamIQ Shared Unit) – a new multi‑core management unit that allows heterogeneous cores to share L3 cache within a cluster, reducing inter‑core data transfer loss.
Performance‑Power Optimization – the A76 is built on a 7 nm process; compared with the 10 nm A75 it can deliver up to 40 % higher performance or 50 % lower power at the same frequency.
Cache hierarchy – L1 consists of 64 KB instruction cache and 64 KB data cache per core; L2 can be configured as 256 KB or 512 KB; L3 is a shared cache of 2 MB or 4 MB inside the DSU.
Branch Prediction Unit (BPU) – works in parallel with the fetch unit to predict the most likely path and pre‑fetch instructions, reducing branch‑prediction latency.
Front‑end – the A76 provides a 4‑way decoder (one more decoder than A75) and a 4‑way instruction fetch front‑end.
ROB (Re‑Order Buffer) – 128 entries, enabling extensive out‑of‑order execution and pipeline filling.
Execution Engine – 120 entries divided into integer, floating‑point and load/store units (1 branch unit, 2 simple ALUs, 1 complex ALU, 2 SIMD units, 2 AGUs).
Load‑Store Unit (LSU) – connects to two AGUs, 64 KB L1 data cache, provides two 16 B/cycle load ports and one 32 B/cycle store port.
Summary of A76 – The article walks through fetch, decode, dispatch, execution and memory access, giving a concise picture of the A76 pipeline.
2. Cortex‑A77 vs. A76
Performance uplift – on a 7 nm 3 GHz process A77 delivers ~20 % higher single‑thread performance than A76.
L0 (MOP) Cache – A77 introduces a Macro‑Operation cache (L0) that stores decoded MOPs. When a hit occurs the decode stage can be bypassed, delivering up to 6 MOPs per cycle (vs. 4 MOPs without a hit). Reported hit rate is ~85 %.
Front‑end – still a 4‑way decoder, but the MOP cache allows up to 6 MOPs per cycle to be fed to the pipeline.
ROB – size increased by 25 % to 160 entries.
Execution Engine – adds an extra branch unit (doubling branch‑prediction bandwidth) and a fourth simple integer ALU, raising integer ALU count from 4 to 6 (≈50 % increase). Issue queues are unified into three categories (integer, floating‑point, load/store).
LSU – retains two AGUs but adds two additional store ports, doubling store bandwidth. Load/store buffers are deeper (85‑level load, 90‑level store, total 175 concurrent memory ops, 25 % deeper than A76).
3. Cortex‑A78 vs. A77
Performance‑Power – A78 (code‑named “Hercules”) moves to a 5 nm process, offering ~20 % performance gain and ~50 % power reduction at comparable frequencies.
Key architectural changes :
L1 cache options: 32 KB or 64 KB per core.
Branch predictor bandwidth doubled again.
Execution engine adds a dedicated MUL unit (allowing two integer multiplies per cycle) and an extra AGU for stores, raising store bandwidth from 16 B/cycle to 32 B/cycle.
The A78 is the final ARMv8 micro‑architecture generation, serving as a bridge to the newer Cortex‑X series and ARMv9.
Final Summary
The article provides a detailed comparative table of A76, A77 and A78, highlighting how each generation improves performance, power efficiency and architectural features. It also points to future directions such as the Cortex‑X series and ARMv9.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
OPPO Kernel Craftsman
Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
