How ORGEval Revealed DeepSeek‑V3’s Surprising Modeling Strength
The paper introduces ORGEval, a graph‑theoretic evaluation framework that replaces costly solvers with bipartite‑graph isomorphism checks, proves a sufficient condition for WL‑test correctness, and shows on the Bench4Opt benchmark that DeepSeek‑V3 outperforms leading inference models in speed, consistency, and overall modeling accuracy.
01 Traditional Evaluation Pitfalls
When using large language models (LLMs) to generate optimization models, the classic workflow is to feed the generated model into a solver and compare the optimal value with a reference answer. This approach suffers from three major failure modes:
Accidental correctness: two different models may happen to produce the same optimal value under a specific parameter set, masking errors.
Infeasible instances: some parameter configurations render the problem unsolvable, causing the evaluation pipeline to stall.
Solver bottleneck: mixed‑integer linear programming (MILP) problems can require hours or days to solve, making rapid feedback loops such as RLHF impossible.
In short, the old method "let two models take an exam and compare answers" is fundamentally unfair.
02 ORGEval’s Dimensionality Reduction
ORGEval proposes a counter‑intuitive shift: instead of running a solver, compare the generated model directly with the reference model by checking how similar their structures look.
The paper represents each instantiated optimization model as a bipartite graph where variables and constraints form the two node sets and edges encode coefficient relationships.
Model equivalence thus becomes a pure graph‑theoretic problem: are the two graphs isomorphic? Lemma C.1 formally establishes that two models are equivalent if and only if their corresponding bipartite graphs are isomorphic, regardless of variable names or constraint ordering.
This reduction yields four immediate benefits:
No need to run a solver.
Results are immune to parameter tricks.
Infeasible problems can still be compared.
Performance is independent of problem size.
03 Graph Isomorphism Is Hard
Unfortunately, graph isomorphism is a classic computationally hard problem with no known polynomial‑time algorithm in the general case.
A widely used heuristic is the Weisfeiler‑Lehman test (WL‑test), which repeatedly recolors nodes based on neighbor colors. If after several rounds the color distributions differ, the graphs are certainly non‑isomorphic.
The WL‑test is fast but can produce false positives: two non‑isomorphic graphs may be deemed isomorphic, which is unacceptable for rigorous evaluation.
04 Core Contributions
The authors prove a sufficient condition under which the WL‑test is guaranteed to be 100 % accurate. When a graph belongs to the class of Symmetric Decomposable (SD) graphs, the WL‑test never yields a false positive.
Definition 3.1 (SD graphs) provides a strict formal definition of this class.
Theorem 3.1 shows that for SD graphs, identical WL‑test color distributions are equivalent to model equivalence.
Based on this insight the paper introduces two practical components:
An automatic detection algorithm that checks whether a given optimization model’s graph satisfies the SD condition.
A pipeline that applies the standard WL‑test only to SD graphs, guaranteeing correct isomorphism decisions.
By restricting the WL‑test to its safe region, the method eliminates the original WL‑test’s unreliability while preserving its speed.
05 Bench4Opt Benchmark
Speed: orders of magnitude faster
The authors built the Bench4Opt dataset (model‑data separation) from MIPLIB and handcrafted instances, then evaluated eight mainstream LLMs.
Easy : solver ≈ 1 hour, ORGEval = 0.21 s.
Hard : solver > 1 hour, ORGEval = 3.83 s.
Open : solver fails to finish, ORGEval = 32.07 s.
This speedup not only accelerates evaluation but also makes previously infeasible instances tractable.
Consistency: 100 % vs 35.62 %
Each model was tested with five random parameter sets. Solver feasibility consistency averaged only 35.62 %, meaning more than 60 % of cases disagreed on feasibility, whereas ORGEval achieved 100 % consistency.
Surprising Findings
When the new “ruler” was applied, the benchmark revealed two key observations:
Optimization modeling remains challenging for all LLMs; even the strongest model scored just above 50 %.
Inference‑optimized models (o1, o3, DeepSeek‑R1) performed worse than their base counterparts, with gaps up to 7.1 % (R1 vs V3).
Inference models’ multi‑step reasoning tends to amplify early hallucinations: a mistaken constraint in the first step propagates and magnifies through subsequent steps.
06 Takeaways
The work demonstrates that the evaluation metric itself can be the bottleneck; by redesigning the metric with ORGEval’s graph‑theoretic approach, the authors expose hidden weaknesses in popular LLMs and provide a fast, reliable, and scalable benchmark for optimization modeling.
Paper URL: https://arxiv.org/pdf/2510.27610
Authors’ affiliations: The Chinese University of Hong Kong (Shenzhen), Shenzhen Big Data Research Institute, Shenzhen River‑Loop College, Shenzhen International Industrial & Applied Mathematics Center
Conference: ICML 2026 CTB Workshop https://sites.google.com/view/icml-ctb/homeSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
