How GPU Frequency, Power Consumption, and FLOPS Interrelate

The article explains the theoretical and practical relationships between GPU clock frequencies, power consumption, and FLOPS, describes key hardware metrics such as SM, memory, and video clocks, shows how to query and set these values with nvidia‑smi, and presents experiments on a Tesla P4 that reveal the non‑linear trade‑offs between performance, power, and temperature.

Infra Learning Club
Infra Learning Club
Infra Learning Club
How GPU Frequency, Power Consumption, and FLOPS Interrelate

1. Introduction

When studying GPU DVFS, it is essential to understand how frequency, power, and performance (measured in FLOPS) are related.

2. Basic Concepts

FLOPS

FLOPS (Floating‑point Operations Per Second) indicates the theoretical peak compute capability of a GPU. It can be broken down into FP16, FP32, and FP64, representing the number of 16‑bit, 32‑bit, and 64‑bit floating‑point operations performed each second.

SM Clock

SM Clock (also called Core, Engine, or Graphics Clock) is expressed in MHz and reflects the speed of each CUDA core. Higher SM Clock generally means higher per‑core compute capability, so a higher clock yields better FLOPS.

Memory Clock

Memory Clock describes the frequency of the GPU’s memory subsystem, also in MHz, and determines memory bandwidth.

3. Hardware Metrics that Influence FLOPS

Because power consumption follows the relation P = C·V²·F (where P is power, C capacitance, V voltage, and F frequency), increasing the clock raises power and temperature, which can cause throttling or hardware failure. To reduce power, voltage or frequency must be lowered, but lowering frequency also reduces performance.

Clock Frequency

The GPU clock frequency directly affects CUDA kernel performance. Users can view current clocks with nvidia‑smi -q -d CLOCK and set them using commands such as

nvidia‑smi -i [GPU_ID] --lock‑gpu‑clocks=<core_clock>

or nvidia‑smi -i [GPU_ID] -ac {memory_clock},{sm_clock}. However, these locks are not hard constraints; adaptive scaling may override them under heavy load.

Power

Power limits can be queried and set via nvidia‑smi -q -d POWER and nvidia‑smi -i {index} -pl 60. The output lists current, requested, default, minimum, and maximum power limits, as well as power samples.

Temperature

Temperature is obtained with nvidia‑smi -q -d TEMPERATURE. Exceeding the GPU’s shutdown or slowdown temperature triggers thermal throttling, which automatically reduces clock and voltage.

4. Experiment Design

Using a Tesla P4, the author measured sustained FP16 GEMM performance for square matrices (M=N=K) across a range of sizes. The test ran 500 iterations for each of the three operand layouts (NN, NT, TN) using four APIs: cublasHgemm, cublasGemmEx (default and tuned), and cublasLt. The following code implements the benchmark:

#include <iostream>
#include <vector>
#include <chrono>
#include <cuda_fp16.h>
#include <cublas_v2.h>
#include <cublasLt.h>
#include <cmath>

#define CHECK_CUDA(call) { cudaError_t err = call; if (err != cudaSuccess) { std::cerr << "CUDA error: " << cudaGetErrorString(err) << " at line " << __LINE__ << std::endl; exit(1); } }
#define CHECK_CUBLAS(call) { cublasStatus_t stat = call; if (stat != CUBLAS_STATUS_SUCCESS) { std::cerr << "cuBLAS error: " << stat << " at line " << __LINE__ << std::endl; exit(1); } }

double calc_gflops(int M,int N,int K,double ms){ double flops = 2.0*M*N*K; return (flops/(ms/1000.0))/1e9; }

void run_test(cublasHandle_t h, cublasLtHandle_t lt, int M,int N,int K,const std::string& api,int runs=500){ /* allocation, initialization, GEMM loops omitted for brevity */ }

int main(){ cublasHandle_t h; cublasLtHandle_t lt; CHECK_CUBLAS(cublasCreate(&h)); CHECK_CUBLAS(cublasLtCreate(<)); std::vector<int> sizes = {128,256,384,512,640,768,896,1024,1152,1280,1408,1536,1664,1792,1920,2048,2560,3072,3584,4096,4608,5120,5632,6144,6656,7168,7680,8192,12288,16384,20480,24576,28672,32768,36864,40960};
 std::vector<std::string> apis = {"cublasHgemm","cublasGemmEx_Default","cublasGemmEx_Tuned","cublasLt_Tuned"};
 for(int s: sizes){ std::cout << "
Testing Matrix Size: " << s << "x" << s << "x" << s << std::endl; for(const auto& api: apis) run_test(h,lt,s,s,s,api); }
 CHECK_CUBLAS(cublasDestroy(h)); CHECK_CUBLAS(cublasLtDestroy(lt)); return 0; }

Experimental Observations

When the SM and memory clocks were set to their maximum values ( nvidia‑smi -i 0 -ac 3003,1531), the Tesla P4 achieved about 97 GFLOPS on a 1024³ matrix, close to the advertised FP16 peak. Power consumption at this setting was roughly half of the maximum limit (≈30 W of a 60 W limit). However, the GPU’s default clocks yielded lower performance (≈27 W) and lower GFLOPS.

Repeated runs showed that the relationship among frequency, power, and FLOPS is not linear; optimal performance‑power trade‑offs require extensive measurement and fitting.

5. Conclusion

GPU performance cannot be maximized simply by running at the highest clock; power, voltage, and temperature constraints create non‑linear trade‑offs. Understanding and tuning SM, memory, and application clocks with tools like nvidia‑smi, combined with empirical benchmarking, is necessary to locate the optimal operating point for a given workload.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

GPUNvidiaPower ConsumptionFLOPSDVFSTemperatureClock SpeedTesla P4
Infra Learning Club
Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.