Inside Loongson 3A5000: Architecture, Cache Hierarchy, and Display Modes Explained
The article provides a detailed technical overview of the Loongson 3A5000 processor, covering its multi‑core architecture, cache coherence protocol, AXI interconnect, performance specs, and four CPU‑GPU‑display communication modes, while omitting promotional content.
Loongson 3A5000 Overview
Loongson 3A5000 is the first LoongArch‑compatible general‑purpose multi‑core processor from Loongson, aimed at desktop computers and servers. It integrates four 64‑bit LA464 cores, a 16 MB split shared L3 cache, two DDR4‑3200 memory controllers, two 16‑bit HyperTransport (HT) controllers, I²C, UART, SPI, and 16 GPIOs. The cores and shared L3 cache are linked by an AXI interconnect forming a distributed shared‑L3 architecture.
Cache Coherence and Scalability
The processor employs a directory‑based cache‑coherence protocol to maintain consistency. It also supports multi‑chip scaling; multiple chips can be connected via their HT buses to form a larger shared memory system, supporting up to 16 chips.
LA464 Core Details
Each LA464 core is a four‑issue, out‑of‑order, 64‑bit processor with a 256‑bit vector unit. It features register renaming, dynamic scheduling, branch prediction, and supports both integer and vector execution. L1 instruction and data caches are 64 KB each (4‑way set associative). A 256 KB victim cache (16‑way) serves as a private L2. The design includes non‑blocking accesses, load speculation, and a standard JTAG debug interface.
Chip‑Level Interconnect
The first‑level interconnect is a 5×5 crossbar linking the four LA464 cores (masters), four shared cache modules (slaves), and an I/O port (IO‑RING). The second‑level interconnect is a 5×3 crossbar connecting the shared caches, two DDR3/4 controllers, and the I/O port. The I/O‑RING comprises two HT controllers, a MISC module, and a SE module. HT controllers share a 16‑bit bus that can be split into two 8‑bit buses or used exclusively. An integrated DMA controller handles I/O DMA and maintains inter‑chip coherence. All data paths are 128‑bit, running at core frequency, with a 256‑bit path from cores to the shared cache to boost bandwidth.
The processor can reach a 2.5 GHz clock speed, delivering a peak floating‑point performance of 160 GFLOPS.
CPU‑GPU‑DC Communication Example
In the Loongson 3A3000 + 7A1000 bridge, the CPU, GPU, and display controller (DC) communicate via a HyperTransport bus. Four display modes are described:
Mode 1: No GPU; CPU and DC share main memory as a framebuffer.
Mode 2: No GPU; DC uses dedicated video memory as a framebuffer.
Mode 3: CPU and GPU share main memory; GPU writes results to a memory‑based framebuffer.
Mode 4: GPU uses dedicated video memory as a framebuffer; DC reads from it.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
