Fundamentals 13 min read

Comparison of ARM Cortex‑A76, A77 and A78 Microarchitectures

The article surveys ARM’s Cortex‑A76 design—detailing its DSU, cache hierarchy, branch predictor, larger ROB and execution units—then contrasts the A77’s macro‑op cache, larger ROB and wider execution pipelines, and the A78’s 5 nm‑based performance‑power gains, enhanced branch prediction, added MUL unit and expanded store bandwidth.

OPPO Kernel Craftsman

Jul 22, 2022

Comparison of ARM Cortex‑A76, A77 and A78 Microarchitectures

With the rapid development of smartphones, ARM updates its CPU core designs almost every year. From 2018 to 2020 three generations of Cortex‑A76, A77 and A78 were released, each bringing notable micro‑architectural changes. This article reviews the key blocks of the A76 design and then compares A77 and A78 against it.

1. ARM Cortex‑A76 Microarchitecture

The A76 micro‑architecture can be explored on wikichip (en.wikichip.org). Important blocks include:

DSU (DynamIQ Shared Unit) – a new multi‑core management unit that allows heterogeneous cores to share L3 cache within a cluster, reducing inter‑core data transfer loss.

Performance‑Power Optimization – the A76 is built on a 7 nm process; compared with the 10 nm A75 it can deliver up to 40 % higher performance or 50 % lower power at the same frequency.

Cache hierarchy – L1 consists of 64 KB instruction cache and 64 KB data cache per core; L2 can be configured as 256 KB or 512 KB; L3 is a shared cache of 2 MB or 4 MB inside the DSU.

Branch Prediction Unit (BPU) – works in parallel with the fetch unit to predict the most likely path and pre‑fetch instructions, reducing branch‑prediction latency.

Front‑end – the A76 provides a 4‑way decoder (one more decoder than A75) and a 4‑way instruction fetch front‑end.

ROB (Re‑Order Buffer) – 128 entries, enabling extensive out‑of‑order execution and pipeline filling.

Execution Engine – 120 entries divided into integer, floating‑point and load/store units (1 branch unit, 2 simple ALUs, 1 complex ALU, 2 SIMD units, 2 AGUs).

Load‑Store Unit (LSU) – connects to two AGUs, 64 KB L1 data cache, provides two 16 B/cycle load ports and one 32 B/cycle store port.

Summary of A76 – The article walks through fetch, decode, dispatch, execution and memory access, giving a concise picture of the A76 pipeline.

2. Cortex‑A77 vs. A76

Performance uplift – on a 7 nm 3 GHz process A77 delivers ~20 % higher single‑thread performance than A76.

L0 (MOP) Cache – A77 introduces a Macro‑Operation cache (L0) that stores decoded MOPs. When a hit occurs the decode stage can be bypassed, delivering up to 6 MOPs per cycle (vs. 4 MOPs without a hit). Reported hit rate is ~85 %.

Front‑end – still a 4‑way decoder, but the MOP cache allows up to 6 MOPs per cycle to be fed to the pipeline.

ROB – size increased by 25 % to 160 entries.

Execution Engine – adds an extra branch unit (doubling branch‑prediction bandwidth) and a fourth simple integer ALU, raising integer ALU count from 4 to 6 (≈50 % increase). Issue queues are unified into three categories (integer, floating‑point, load/store).

LSU – retains two AGUs but adds two additional store ports, doubling store bandwidth. Load/store buffers are deeper (85‑level load, 90‑level store, total 175 concurrent memory ops, 25 % deeper than A76).

3. Cortex‑A78 vs. A77

Performance‑Power – A78 (code‑named “Hercules”) moves to a 5 nm process, offering ~20 % performance gain and ~50 % power reduction at comparable frequencies.

Key architectural changes :

L1 cache options: 32 KB or 64 KB per core.

Branch predictor bandwidth doubled again.

Execution engine adds a dedicated MUL unit (allowing two integer multiplies per cycle) and an extra AGU for stores, raising store bandwidth from 16 B/cycle to 32 B/cycle.

The A78 is the final ARMv8 micro‑architecture generation, serving as a bridge to the newer Cortex‑X series and ARMv9.

Final Summary

The article provides a detailed comparative table of A76, A77 and A78, highlighting how each generation improves performance, power efficiency and architectural features. It also points to future directions such as the Cortex‑X series and ARMv9.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Arm Cortex-A76 Cortex-A77 Cortex-A78 CPU microarchitecture

Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.