Artificial Intelligence 28 min read

How OIBench & CoreCodeBench Expose the Real Coding Limits of LLMs

The Meituan‑M17 team and Shanghai Jiao Tong University introduced two new benchmarks, OIBench and CoreCodeBench, to more accurately evaluate large language models' algorithmic and engineering coding abilities, revealing a substantial gap between claimed performance and actual capability across a range of tasks and models.

Meituan Technology Team

Jul 17, 2025

How OIBench & CoreCodeBench Expose the Real Coding Limits of LLMs

Meituan‑M17 team together with Shanghai Jiao Tong University released two new datasets, OIBench and CoreCodeBench, to provide a more realistic and discriminative evaluation of large language models' programming abilities.

OIBench: Background – Deep Analysis of Benchmark Limitations

Despite impressive claims of “competition‑level” performance by models such as GPT‑4o, AlphaCode, and Claude 3.5, their success rates drop dramatically on high‑difficulty informatics‑Olympiad problems, revealing a large gap between advertised and actual capabilities.

Benchmark saturation and low discriminative power : Traditional suites (HumanEval, MBPP) now see >90 % pass rates, unable to separate top models.

Data leakage risk : High‑difficulty problems may already be present in pre‑training data, inflating scores.

Human‑machine comparison limitations : Elo‑based scoring suffers from long evaluation cycles and poor reproducibility.

Coarse efficiency metrics : Runtime and memory are often aggregated, hiding per‑task differences.

Construction and Innovation of OIBench

OIBench contains 212 private, high‑difficulty informatics‑Olympiad problems (IOI level) sourced from ACM‑ICPC teams and university coaches. Three strict criteria ensure quality:

Originality & privacy : All problems are verified to be absent from public platforms.

Difficulty grading : Problems are labeled with contest difficulty and only admitted if at most one of four strong LLMs can solve them.

Robust test cases & reference solutions : Each problem includes extensive test suites and a verified C++ reference implementation.

Bilingual support : Both Chinese and English versions are provided.

The dataset is hosted on HuggingFace and GitHub, and has been used to evaluate 18 mainstream LLMs in C++, Python, Java, and JavaScript.

OIBench Evaluation Results

Inference models excel : Models such as o4‑mini‑high achieve an average score of 21.4 % versus ~3.6 % for instruction‑tuned models.

Closed‑source advantage : Closed‑source models average 14.5 % versus 6.3 % for open‑source.

Base model determines ceiling : Performance correlates strongly with the underlying pre‑training model.

DeepSeek‑V3‑0324 stands out among non‑inference models due to chain‑of‑thought distillation.

Language bias : JavaScript and Python scores are ~10 % lower than C++/Java; Chinese and English versions perform similarly.

Providing pseudocode hints improves all models, especially strong inference models, indicating that the benchmark isolates reasoning difficulty from code synthesis.

Analysis of token consumption shows o4‑mini‑high achieves the best inference efficiency.

Human vs. Model Comparison

By inviting ACM‑ICPC team members from top Chinese universities to solve a subset of OIBench problems, a more reproducible human‑machine comparison was obtained. Rankings show that the best inference model surpasses roughly 42 % of human participants.

OIBench Summary and Outlook

OIBench demonstrates a substantial “promotion‑reality” gap for LLMs on high‑difficulty algorithmic tasks and provides a new, high‑discriminative benchmark for future research.

CoreCodeBench: Background – Challenges of Engineering‑Level Code Evaluation

Existing engineering benchmarks (FullStackBench, SWEBench) focus on single‑function generation and lack coverage of multi‑file collaboration, bug fixing, test‑driven development, and multi‑function coordination.

Construction of CoreCodeBench

CoreCodeBench uses an automated pipeline called CorePipe to extract core code from high‑activity, well‑tested open‑source projects on GitHub. The pipeline:

Selects real projects with high activity and test coverage.

Identifies core functions via dynamic/static tracing and AST analysis.

Simulates three realistic scenarios : Development, BugFix, and Test‑Driven Development (TDD).

Generates multi‑function tasks by composing functions according to call graphs.

Each task includes comprehensive test suites and reference implementations. The dataset is bilingual and hosted on AGI‑Eval.

CoreCodeBench Evaluation Results

Newer models (Claude 3.7, o4‑mini‑high) show clear progress, yet all models struggle with BugFix tasks, especially on single‑function bugs.

Multi‑function tasks expose a major bottleneck: models perform significantly worse than on single‑function tasks.

Models tend to follow the order of functions in the prompt rather than planning based on dependency, indicating limited planning ability.

Information‑gain scoring and expert review ensure high quality (78.55 % pass rate). Radar charts reveal that no single model dominates all six core scenarios, confirming the benchmark’s comprehensive coverage.

CoreCodeBench Summary and Future Directions

CoreCodeBench uncovers three universal shortcomings of current LLMs: difficulty fixing logical bugs, weak multi‑function coordination, and lack of flexible planning. Continued iteration of the benchmark aims to push LLMs toward true “virtual software engineers.”

One More Thing – From LLMs to Code Agents: Shifting the Evaluation Paradigm

Existing Code‑Agent benchmarks (e.g., SWE‑bench) exclude human developers, limiting insight into real‑world collaboration. The Meituan‑M17 team proposes a human‑LLM collaborative programming competition that records intent understanding, clarification effectiveness, interaction rounds, decision efficiency, and final task quality.

The competition will produce the first human‑machine collaboration leaderboard for Code Agents, offering deep insights for the next generation of development tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

artificial intelligence LLM evaluation algorithmic assessment code benchmarking engineering code

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.