Artificial Intelligence 9 min read

Sneaky Tricks to Inflate Deep Learning Model Scores (And Why They’re Misleading)

The article enumerates a series of dubious techniques—from inflating batch sizes and hidden compute to hyper‑parameter tricks and fabricated evaluation methods—designed to artificially boost deep‑learning model scores on benchmarks, exposing how easy it is to game performance metrics.

Baobao Algorithm Notes

Nov 11, 2024

Sneaky Tricks to Inflate Deep Learning Model Scores (And Why They’re Misleading)

作者:黄哲威 hzwer
链接：https://www.zhihu.com/question/347847220/answer/26536819499

Compute Overpower

Increase the batch size while keeping the reported number of iterations unchanged, effectively raising the total number of processed samples.

Train for more epochs but report training length in terms of iterations (or vice‑versa) to conceal the mismatch.

Keep the epoch count fixed but reuse each training sample multiple times, thereby increasing data exposure without changing the epoch number.

Reduce the number of down‑sampling operations in the architecture, which raises FLOPs dramatically while still comparing models only by parameter count.

Ignore constraints on compute and parameters and simply add more GPU/TPU resources.

Describe high‑compute components briefly and perform efficiency analysis only on the remaining parts.

Apply re‑parameterization tricks that enlarge the model and slow training, while claiming inference‑time savings.

Use exponential moving average (EMA) or multi‑model ensembling for modest gains; add self‑distillation when possible.

Select an extremely small training set to focus on over‑fitting and inflate apparent performance.

Hyper‑parameters

Swap cosine learning‑rate decay for a fixed learning rate (or the opposite) to obtain desired performance curves; note that the final phase of cosine decay often yields a sharp performance jump.

Increase the base learning rate slightly while decreasing the learning rate of a baseline model.

Hide many hyper‑parameters inside the code as undocumented “magic numbers”.

Exploit the numerous tunable hyper‑parameters of optimizers (e.g., momentum, weight decay, epsilon) to fine‑tune results.

Choose random seeds strategically to obtain favorable stochastic outcomes.

Minor Tweaks

Replace ReLU activations with Swish, Leaky ReLU, or PReLU to gain small accuracy improvements.

Insert Squeeze‑Excitation (SE) layers or inexpensive attention connections throughout the network.

Swap parameter‑free operations such as plain pooling or nearest‑neighbor resize with learnable equivalents (e.g., adaptive pooling, learned up‑sampling).

Concatenate additional intermediate features between modules without a principled justification.

Add or remove batch normalization; alternatively replace it with GroupNorm, InstanceNorm, LayerNorm, or WeightNorm where appropriate.

Augment the training data to better match the distribution of the test set, thereby reducing domain shift.

Incremental Design

Introduce exotic loss terms (e.g., GAN loss, consistency loss) even when their utility is unclear, and fill the paper with accompanying formulas.

Take a single sentence from another work, expand it into a detailed component, and add “magic” mathematical transformations to occupy space.

When adding a new component x, create a learnable scaling factor beta initialized to 0 and compute beta * x; if beta remains 0 the model is unchanged, providing a safety net.

Generalize the previous idea by stacking many such learnable‑parameter components.

Insert a Neural Architecture Search (NAS) module to claim automated design benefits.

Borrow pretrained weights from other models to raise the initial performance ceiling, effectively adding external data and labels.

Design complex curriculum‑learning schedules or sophisticated distillation schemes (feature‑level, cross‑modal) and attribute failures of competing methods to insufficient hyper‑parameter tuning.

Wrap the entire training pipeline in a reinforcement‑learning framework regardless of actual benefit.

Testing Methods

Measure a large set of metrics (e.g., ten) but report only the three that show improvement.

Run experiments on many datasets and discard the ones that do not demonstrate gains.

Deliberately misalign test conditions with prior work (e.g., invert RGB channels) to create an unfair baseline for competitors.

Invent new evaluation metrics or modify existing ones (e.g., compute PSNR on the Y channel while others use RGB) to inflate reported performance.

Identify trivial scenarios overlooked by others and claim large improvements in those niche cases.

Compare a large model against a smaller baseline without reporting the larger baseline’s results, or compare a model trained for a specific metric against an untrained counterpart.

Benchmark on different hardware platforms and present all results together without normalizing for compute differences.

For language models, subtly add prompts or few‑shot examples in the test phase to boost scores.

Leak test data, random seeds, or incorporate test samples into upstream pre‑training to artificially improve results.

Augment test sets with real‑world or out‑of‑distribution samples that degrade the baseline, then apply additional augmentation or dropout to recover points and attribute the gain to other components.

Use private test sets evaluated by human judges, demanding visibly large improvements.

When objective comparisons fail, rely on subjective, cherry‑picked results.

Ultimate Methods

Copy an existing method and rename it without substantive novelty.

Claim high performance while the public repository contains only a README file.

Write a paper without conducting experiments, asserting a marginal state‑of‑the‑art improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning benchmark cheating hyperparameters AI tricks

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.