Artificial Intelligence 3 min read

Shadow APIs vs Official LLMs: Up to 47% Performance Gap Revealed in New Study

A recent arXiv paper audits 17 widely used shadow APIs, showing that their outputs can deviate from official large language model APIs by as much as 47.21%, with accuracy on the MedQA benchmark dropping from 83.82% to around 37%, raising serious reliability concerns.

DeepHub IMBA

Mar 6, 2026

Shadow APIs vs Official LLMs: Up to 47% Performance Gap Revealed in New Study

Accessing cutting‑edge large language models such as GPT‑5 or Gemini‑2.5 often involves high costs, payment barriers, or geographic restrictions. This has spurred the emergence of third‑party “shadow APIs” that claim to provide cheaper, unrestricted access to the same models.

The authors of a March 2026 arXiv paper (arXiv:2603.01919v1) investigated whether shadow APIs truly deliver the same results as official APIs. They identified 17 shadow API services that have been cited in 187 academic papers, noting that the most popular project on GitHub has nearly 60 000 stars and over 5 900 citations.

"I used the official Claude API without issues; the problem with OpenRouter might just be a routing error," the author noted in a community discussion.

To assess these services, the study evaluated them across three dimensions: Utility, Safety, and Model Verification. Experiments compared the performance of official APIs against their shadow counterparts on several benchmarks.

One striking result was a performance discrepancy of up to 47.21%. For example, on the high‑risk medical benchmark MedQA, the official Gemini‑2.5‑flash model achieved an accuracy of 83.82%, whereas the shadow APIs tested fell to roughly 37.00%.

The paper includes detailed tables (see image) illustrating the gaps across various metrics, but the full data are omitted here for brevity.

The authors conclude that the widespread use of shadow APIs—especially in research—poses a significant threat to downstream application reliability and scientific validity, as many results may be based on falsified or degraded model outputs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models performance evaluation AI safety model verification shadow APIs

Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.