Artificial Intelligence 11 min read

Introducing FullStack Bench: Multi‑Language Code LLM Benchmark & SandboxFusion

The article presents FullStack Bench, a newly open‑sourced, multi‑language code‑LLM evaluation dataset covering over 11 real‑world programming scenarios and 16 languages, along with the SandboxFusion execution environment, and reports comprehensive benchmark results that highlight the superiority of closed‑source models over most open‑source alternatives.

Volcano Engine Developer Services

Dec 10, 2024

Introducing FullStack Bench: Multi‑Language Code LLM Benchmark & SandboxFusion

FullStack Bench Overview

On December 5, ByteDance's Doubao large‑model team released FullStack Bench, an open‑source code evaluation dataset focused on full‑stack programming and multilingual code generation. The benchmark contains 3,374 problems across more than 11 real‑world application domains and 16 programming languages, providing problem statements, reference solutions, unit tests, and tags.

The dataset fills a gap in existing code‑LLM benchmarks, which typically cover only basic or advanced programming tasks in a limited set of languages. FullStack Bench enables a more realistic assessment of large language models (LLMs) in diverse development scenarios such as front‑end, back‑end, and machine‑learning tasks.

SandboxFusion Execution Engine

To support the multi‑language evaluation needs of FullStack Bench, the team also open‑sourced SandboxFusion, a sandbox execution tool that supports 23 common programming languages. It provides dataset modules and a sandbox module that safely runs generated code, captures output, and computes pass‑rate metrics.

The evaluation workflow includes prompt generation, model inference (performed externally), code extraction, test synthesis, code execution, result judgment, and metric calculation.

Benchmark Results

Using FullStack Bench, the researchers evaluated over twenty code‑LLMs, including open‑source models such as Qwen2.5‑Coder, DeepSeek‑Coder‑v2, and CodeLlama, as well as closed‑source models like GPT‑4o, OpenAI‑o1, and Doubao‑Coder‑Preview. Results show that closed‑source models generally outperform open‑source ones, especially on difficult tasks and in the mathematical programming domain.

Performance varies across languages; models excel in Bash but show large gaps in C++, C, and Ruby, indicating language‑specific training biases. The sandbox’s compiler feedback reveals a positive correlation between compilation success and test pass rates.

Impact of Feedback Strategies

Experiments comparing a “Reflection” strategy (iterative refinement using SandboxFusion feedback) with a “BoN” strategy (single‑shot inference) demonstrate that feedback‑driven refinement significantly improves model accuracy.

Resources

Paper: https://arxiv.org/abs/2412.00535

Dataset: https://huggingface.co/datasets/ByteDance/FullStackBench

SandboxFusion code: https://github.com/bytedance/SandboxFusion

Sandbox playground: https://bytedance.github.io/SandboxFusion/playground/datasets

FullStack Bench and SandboxFusion aim to provide a fast, comprehensive evaluation reference for AI performance in real programming scenarios, thereby accelerating the development of code‑focused large language models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark AI evaluation multilingual code LLM FullStack Bench SandboxFusion

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.