Introducing FullStack Bench: Multi‑Language Code LLM Benchmark & SandboxFusion
The article presents FullStack Bench, a newly open‑sourced, multi‑language code‑LLM evaluation dataset covering over 11 real‑world programming scenarios and 16 languages, along with the SandboxFusion execution environment, and reports comprehensive benchmark results that highlight the superiority of closed‑source models over most open‑source alternatives.
FullStack Bench Overview
On December 5, ByteDance's Doubao large‑model team released FullStack Bench, an open‑source code evaluation dataset focused on full‑stack programming and multilingual code generation. The benchmark contains 3,374 problems across more than 11 real‑world application domains and 16 programming languages, providing problem statements, reference solutions, unit tests, and tags.
The dataset fills a gap in existing code‑LLM benchmarks, which typically cover only basic or advanced programming tasks in a limited set of languages. FullStack Bench enables a more realistic assessment of large language models (LLMs) in diverse development scenarios such as front‑end, back‑end, and machine‑learning tasks.
SandboxFusion Execution Engine
To support the multi‑language evaluation needs of FullStack Bench, the team also open‑sourced SandboxFusion, a sandbox execution tool that supports 23 common programming languages. It provides dataset modules and a sandbox module that safely runs generated code, captures output, and computes pass‑rate metrics.
The evaluation workflow includes prompt generation, model inference (performed externally), code extraction, test synthesis, code execution, result judgment, and metric calculation.
Benchmark Results
Using FullStack Bench, the researchers evaluated over twenty code‑LLMs, including open‑source models such as Qwen2.5‑Coder, DeepSeek‑Coder‑v2, and CodeLlama, as well as closed‑source models like GPT‑4o, OpenAI‑o1, and Doubao‑Coder‑Preview. Results show that closed‑source models generally outperform open‑source ones, especially on difficult tasks and in the mathematical programming domain.
Performance varies across languages; models excel in Bash but show large gaps in C++, C, and Ruby, indicating language‑specific training biases. The sandbox’s compiler feedback reveals a positive correlation between compilation success and test pass rates.
Impact of Feedback Strategies
Experiments comparing a “Reflection” strategy (iterative refinement using SandboxFusion feedback) with a “BoN” strategy (single‑shot inference) demonstrate that feedback‑driven refinement significantly improves model accuracy.
Resources
Paper: https://arxiv.org/abs/2412.00535
Dataset: https://huggingface.co/datasets/ByteDance/FullStackBench
SandboxFusion code: https://github.com/bytedance/SandboxFusion
Sandbox playground: https://bytedance.github.io/SandboxFusion/playground/datasets
FullStack Bench and SandboxFusion aim to provide a fast, comprehensive evaluation reference for AI performance in real programming scenarios, thereby accelerating the development of code‑focused large language models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
