Unlock Mobile GPU Power: A Hands‑On Guide to OpenCL Programming on Android

This article introduces the fundamentals of heterogeneous computing on mobile GPUs, explains OpenCL concepts and its programming model, and provides a step‑by‑step example of adding two arrays with complete OpenCL code for Android devices.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Unlock Mobile GPU Power: A Hands‑On Guide to OpenCL Programming on Android

Introduction

Mobile chip performance improvements enable graphics‑intensive and deep‑learning inference tasks on smartphones. GPUs provide high floating‑point throughput and API compatibility, making them essential heterogeneous compute units on Android. Qualcomm Adreno and Huawei Mali dominate the mobile GPU market and both support OpenCL.

Fundamental Concepts

Heterogeneous Computing

Systems that combine processing units with different instruction sets and architectures—such as CPUs, GPUs, DSPs, ASICs, and FPGAs—are called heterogeneous computers.

GPU

A Graphics Processing Unit is a specialized microprocessor for parallel graphics and compute workloads. Compared with CPUs, GPUs have many more cores, higher parallelism, and superior performance‑per‑area and performance‑per‑watt ratios, making them well‑suited for mobile workloads.

OpenCL Overview

OpenCL (Open Computing Language), managed by the Khronos Group, is an open standard for heterogeneous programming. It defines a C99‑based kernel language and a set of APIs to discover platforms, create contexts, compile programs, and launch kernels on CPUs, GPUs, DSPs, FPGAs, and other accelerators.

OpenCL Programming Model

Platform Model

On Android the host is the CPU. Each GPU device contains multiple compute units, and each compute unit consists of many processing elements (the streaming multiprocessors on a GPU).

Platform Model
Platform Model

Execution Model

Using clEnqueueNDRangeKernel a pre‑compiled kernel can be launched to process N‑dimensional data in parallel. For a 2‑D image each pixel can be mapped to a separate processing element. Work‑group size is specified in clEnqueueNDRangeKernel; work‑items in the same group share local memory and can synchronize with barriers or use built‑in functions such as async_work_group_copy.

Execution Model
Execution Model

Memory Model

The OpenCL memory hierarchy consists of:

Host Memory – directly accessible by the CPU.

Global/Constant Memory – visible to all compute units on the device.

Local Memory – shared among processing elements within a compute unit.

Private Memory – exclusive to a single processing element.

Memory Model
Memory Model

OpenCL Programming Example – Array Add

The example adds two arrays and is available at https://github.com/xiebaiyuan/opencl_cook/blob/master/array_add/array_add.cpp.

Step 1 – Initialize OpenCL environment

cl_int status; // init device
runtime.device = init_device();
runtime.context = clCreateContext(nullptr, 1, &runtime.device, nullptr, nullptr, &status);
runtime.queue = clCreateCommandQueue(runtime.context, runtime.device, 0, &status);

Step 2 – Build program and create kernel

cl_int status;
runtime.program = build_program(runtime.context, runtime.device, PROGRAM_FILE);
runtime.kernel = clCreateKernel(runtime.program, KERNEL_FUNC, &status);

Step 3 – Prepare buffers and set kernel arguments

// init data
float input_data[ARRAY_SIZE];
float bias_data[ARRAY_SIZE];
float output_data[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
    input_data[i] = 1.f * (float)i;
    bias_data[i] = 10000.f;
}
// create buffers
runtime.input_buffer = clCreateBuffer(runtime.context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
    ARRAY_SIZE * sizeof(float), input_data, &status);
runtime.bias_buffer = clCreateBuffer(runtime.context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
    ARRAY_SIZE * sizeof(float), bias_data, &status);
runtime.output_buffer = clCreateBuffer(runtime.context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
    ARRAY_SIZE * sizeof(float), output_data, &status);
// set kernel arguments
status = clSetKernelArg(runtime.kernel, 0, sizeof(cl_mem), &runtime.input_buffer);
status |= clSetKernelArg(runtime.kernel, 1, sizeof(cl_mem), &runtime.bias_buffer);
status |= clSetKernelArg(runtime.kernel, 2, sizeof(cl_mem), &runtime.output_buffer);

Step 4 – Launch kernel and read results

// enqueue kernel
status = clEnqueueNDRangeKernel(runtime.queue, runtime.kernel, 1, nullptr, &ARRAY_SIZE,
    nullptr, 0, nullptr, nullptr);
// read output buffer
status = clEnqueueReadBuffer(runtime.queue, runtime.output_buffer, CL_TRUE, 0,
    sizeof(output_data), output_data, 0, nullptr, nullptr);
// process output_data ...

Conclusion

As CPU scaling reaches physical limits, programming GPUs and other accelerators becomes a key direction for future mobile computing. The article covered the basics of OpenCL programming and provided a concrete array‑addition example.

References

OpenCL‑Guide: https://github.com/KhronosGroup/OpenCL-Guide/blob/main/chapters/opencl_programming_model.md

OpenCL‑Examples: https://github.com/rsnemmen/OpenCL-examples

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AndroidCOpenCLGPU computingheterogeneous computingMobile GPUParallel Programming
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.