Inside Intel GPU Render Engine: How 3D Rendering Works at the Hardware Level
This article explains the architecture and workflow of Intel's GPU render engine, covering the 3D pipeline, command streamer, fixed‑function units, execution units, URB handling, thread dispatch, shader stages, sampler state, and the Mesa driver implementation that translates OpenGL commands into hardware instructions.
Preface
GPU (Graphics Processing Unit) is a micro‑processor specialized for graphics‑related computation on PCs, workstations, consoles and mobile devices. With the rise of AI, GPUs are also used for parallel training and inference workloads, leading to many servers equipped with GPUs.
Terminology
3D Pipeline : A set of fixed‑function units arranged as a pipeline that processes 3D commands using both fixed‑function (FF) units and Execution Units (EUs).
FFID : Unique identifier for a fixed‑function unit.
CS (Command Streamer) : Parses commands written by the driver into a ring buffer and forwards them to the next stage of the 3D pipeline.
VF (Vertex Fetcher) : The first FF unit that reads vertex data from memory and passes it to the Vertex Shader (VS) stage.
TD (Thread Dispatcher) : Arbitrates thread start requests from FF units and instantiates threads on EUs.
Render Engine Overview
Intel's render engine operates in two modes: 3D rendering and media (codec) mode, which share the same pipeline selection mechanism. The driver selects the mode with the PIPELINE_SELECT command.
<code>// In Mesa code for compute (GPGPU)
emit_pipeline_select(batch, GPGPU);
// In render mode
emit_pipeline_select(batch, _3D);</code>Both user‑space and kernel‑space drivers write commands into a ring buffer, which the hardware then executes according to the selected pipeline.
Hardware Introduction and Analysis
Command Streamer
The CPU writes commands into a batch buffer, which the driver converts to a ring buffer. The Command Streamer reads these commands and dispatches them to the appropriate hardware blocks.
GPU render commands are broadly classified as:
Memory interface commands – operate on memory.
3D state commands – set up the 3D pipeline state (e.g., vertex surface state).
Pipe Control commands – configure synchronization and parallel execution.
3D Primitive commands – describe primitive assembly.
Fixed‑Function Units (FF)
FF units manage most of the processing for vertex and pixel data on the EU threads. They handle thread dispatching, URB entry management, and various control information.
An EU is a multi‑threaded processor within the multi‑processor system. Each EU contains instruction fetch/decode, register files, SIMD ALU, etc.
Execution Unit (EU)
EUs are programmable cores that execute shader and kernel code. They contain General Register Files (GRF) and Architecture Register Files (ARF). Modern generations (Gen11/Gen12) have 7‑thread SIMD units.
<code>add dst.xyz src0.yxzw src1.zwxy</code>Unified Return Buffer (URB)
URB is an on‑chip memory shared by FF units to pass data between threads and fixed‑function stages. Threads read/write URB entries via messages.
Thread Dispatching
When a pipeline stage requests a thread, the Thread Dispatcher allocates register space on an EU, loads control information from URB, and starts execution.
Shader Stages
Vertex Shader (VS)
After VF writes vertex data to URB, the VS reads it via URB handles, launches EU threads, and runs the compiled shader. The driver emits 3DSTATE_VS with kernel start pointer, binding table size, scratch space, etc.
<code>#define INIT_THREAD_DISPATCH_FIELDS(pkt, prefix, stage) \
pkt.KernelStartPointer = KSP(shader); \
pkt.BindingTableEntryCount = shader->bt.size_bytes / 4; \
pkt.FloatingPointMode = prog_data->use_alt_mode; \
pkt.DispatchGRFStartRegisterForURBData = prog_data->dispatch_grf_start_reg; \
pkt.prefix##URBEntryReadLength = vue_prog_data->urb_read_length; \
pkt.prefix##URBEntryReadOffset = 0; \
pkt.StatisticsEnable = true; \
pkt.Enable = true;
static void iris_store_vs_state(struct iris_context *ice,
const struct gen_device_info *devinfo,
struct iris_compiled_shader *shader) {
struct brw_stage_prog_data *prog_data = shader->prog_data;
struct brw_vue_prog_data *vue_prog_data = (void *) prog_data;
iris_pack_command(GENX(3DSTATE_VS), shader->derived_data, vs) {
INIT_THREAD_DISPATCH_FIELDS(vs, Vertex, MESA_SHADER_VERTEX);
vs.MaximumNumberofThreads = devinfo->max_vs_threads - 1;
vs.SIMD8DispatchEnable = true;
vs.UserClipDistanceCullTestEnableBitmask = vue_prog_data->cull_distance_mask;
}
}</code>Sampler State
The sampler provides filtered texture values to the EU. Sampler state objects are created via OpenGL calls (e.g., glGenSamplers ) and translated by Mesa into hardware SAMPLER_STATE entries.
<code>iris_pack_state(GENX(SAMPLER_STATE), cso->sampler_state, samp) {
samp.TCXAddressControlMode = wrap_s;
samp.TCYAddressControlMode = wrap_t;
samp.TCZAddressControlMode = wrap_r;
samp.CubeSurfaceControlMode = state->seamless_cube_map;
samp.NonnormalizedCoordinateEnable = !state->normalized_coords;
samp.MinModeFilter = state->min_img_filter;
samp.MagModeFilter = mag_img_filter;
samp.MipModeFilter = translate_mip_filter(state->min_mip_filter);
samp.MaximumAnisotropy = RATIO21;
if (state->max_anisotropy >= 2) {
if (state->min_img_filter == PIPE_TEX_FILTER_LINEAR) {
samp.MinModeFilter = MAPFILTER_ANISOTROPIC;
samp.AnisotropicAlgorithm = EWAApproximation;
}
if (state->mag_img_filter == PIPE_TEX_FILTER_LINEAR)
samp.MagModeFilter = MAPFILTER_ANISOTROPIC;
samp.MaximumAnisotropy = MIN2((state->max_anisotropy - 2) / 2, RATIO161);
}
}</code>Mesa 3D Driver Implementation
Mesa translates OpenGL state into GPU commands. For example, vertex buffers are emitted with 3DSTATE_VERTEX_BUFFERS , index buffers with 3DSTATE_INDEX_BUFFERS , and binding tables with 3DSTATE_BINDING_TABLE_POINTERS . The driver allocates buffers in specific memory zones (e.g., IRIS_MEMZONE_BINDER ) and writes the virtual addresses into the batch buffer.
<code>/** Memory zones. When allocating a buffer, you can request a specific region of the virtual address space (PPGTT). */
enum iris_memory_zone {
IRIS_MEMZONE_SHADER,
IRIS_MEMZONE_BINDER,
IRIS_MEMZONE_SCRATCH,
IRIS_MEMZONE_SURFACE,
IRIS_MEMZONE_DYNAMIC,
IRIS_MEMZONE_OTHER,
IRIS_MEMZONE_BORDER_COLOR_POOL,
};
static void iris_set_vertex_buffers(struct pipe_context *ctx,
unsigned start_slot, unsigned count,
unsigned unbind_num_trailing_slots,
bool take_ownership,
const struct pipe_vertex_buffer *buffers) {
// ... pack VERTEX_BUFFER_STATE ...
iris_pack_state(GENX(VERTEX_BUFFER_STATE), vb) {
vb.VertexBufferIndex = start_slot + i;
vb.AddressModifyEnable = true;
vb.BufferPitch = buffer->stride;
if (res) {
vb.BufferSize = res->base.b.width0 - (int)buffer->buffer_offset;
vb.BufferStartingAddress = ro_bo(NULL, res->bo->address + (int)buffer->buffer_offset);
vb.MOCS = iris_mocs(res->bo, &screen->isl_dev, ISL_SURF_USAGE_VERTEX_BUFFER_BIT);
} else {
vb.NullVertexBuffer = true;
vb.MOCS = iris_mocs(NULL, &screen->isl_dev, ISL_SURF_USAGE_VERTEX_BUFFER_BIT);
}
}
}
</code>The driver also manages state binding tables, sampler tables, and surface state tables, emitting the corresponding 3DSTATE_* commands to bind them to the hardware.
References
Intel Graphics PRM – Command Stream Programming (DG1)
Intel Graphics Architecture ISA and Microarchitecture
Intel Graphics Core Documentation (965, SKL, etc.)
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.