Why Single‑Thread CPU Loads Crash While Multi‑Threaded Loads Stay Stable – A Debugging Case Study
The article investigates intermittent crashes on an i9‑13900K when a single thread runs at full load, analyzes CPU voltage fluctuations, documents the debugging process using DirectXShaderCompiler, and demonstrates how converting a PSO warm‑up to multi‑threaded execution eliminates crashes and dramatically improves launch speed.
Origin
If you have used an i9‑13900K in a development project, you may have experienced a situation where a program runs fine and then suddenly crashes. The root cause was found to be voltage fluctuations in the CPU power supply. While hardware defects are beyond the control of a development team, the application still needs to run.
We decided to approach the problem from the code side, analyzing and debugging to see if we could reduce the failure rate. This article records that attempt.
Pitfall Process
After checking device details, the first impression was a CPU fault (the i9‑13900K failure rate is very high). However, running other games and tools such as AIDA64 and IPDT on the same machine did not reveal any issues, so we ruled out that possibility.
We then began troubleshooting: crashes consistently occurred in dxilconv‑related code, leading us to suspect incorrect data passed to dxilconv. We fetched the dxilconv source from GitHub and compiled and debugged it extensively. Eventually we captured crucial evidence that shifted the focus back to the CPU:
The value at address rbx+8 was non‑zero, but when read into rax it became zero – a clear CPU issue. The same day the machine also experienced a blue screen, confirming the CPU’s notorious “shrink‑cylinder” fault.
Exploring Programmatic Ways to Reduce CPU Faults
Why did other games and benchmarks not encounter the problem? Could there be a method to lower CPU failure rates through software?
Through multiple comparative tests we found differences between PSO warm‑up and AIDA64:
AIDA64 runs all threads at full load.
PSO warm‑up runs a single thread at full load while other threads remain idle.
We modified the PSO warm‑up to be multi‑threaded, fully loading all cores. The results were impressive:
Before modification, crashes occurred after 2–3 launches. After modification, the program started over 20 times without crashing, and launch time dropped from over 20 seconds to under 3 seconds – a ten‑fold improvement.
Post‑Mortem
Intel has previously researched similar issues, attributing them to aggressive motherboard power‑delivery strategies that exacerbate CPU faults.
Potential reasons for the observed behavior (full‑load on a single core vs. all cores) include:
Motherboards typically use multi‑phase PWM power delivery with fewer phases than cores, causing multiple cores to share a phase. When only one core is fully loaded, its current draw is higher, increasing voltage drop on the shared line, leading to undervoltage on that core.
Increasing PWM duty cycle to raise voltage would over‑voltage idle cores, risking aging and breakdown.
Keeping voltage constant causes the fully loaded core to be undervolted, making logic high/low levels indistinguishable.
When all cores are fully loaded, power distribution is balanced, reducing the likelihood of faults.
These conclusions are speculative without hardware testing; further validation from experts is welcome.
End
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
