Artificial Intelligence 11 min read

Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs

ByteDance’s Doubao model team has open‑sourced Multi‑SWE‑bench, a multilingual benchmark covering seven major programming languages with 1,632 real‑world bug‑fix tasks, complete Docker environments, difficulty grading, and strict human validation, aiming to evaluate and advance large‑language‑model code‑repair capabilities beyond Python.

Volcano Engine Developer Services

Apr 14, 2025

Introducing Multi‑SWE‑bench: The First Multilingual Code‑Fix Benchmark for LLMs

Multi‑SWE‑bench Overview

ByteDance’s Doubao model team has open‑sourced Multi‑SWE‑bench, the first multilingual benchmark for software‑engineering (SWE) tasks. It evaluates large language models’ ability to automatically locate and fix bugs across seven mainstream programming languages.

Motivation

Existing benchmarks such as SWE‑bench focus solely on Python, which limits assessment of a model’s cross‑language generalisation. As LLMs are increasingly used to solve real GitHub issues, a broader, more challenging dataset is needed.

Key Features

Language coverage: Java, Go, Rust, C, C++, TypeScript, JavaScript.

Scale: 1,632 real‑world bug‑fix tasks sourced from GitHub issues.

Difficulty grading: Tasks are classified as Easy, Medium, or Hard, ranging from single‑line patches to multi‑file, multi‑step fixes.

Executable environments: Each task includes a reproducible Docker container that mirrors the original project’s build and test setup.

Strict human validation: Double‑blind annotation by 68 professional reviewers, followed by internal QA, ensures high data quality.

Data Construction Pipeline

Repository selection: Open‑source projects with >500 stars, active maintenance, CI/CD support, and reproducible build processes are chosen.

Pull‑request crawling: PRs linked to issues, containing test changes, and merged into the main branch are collected.

Docker environment creation: Dependencies are extracted to generate Dockerfiles; failing builds are manually fixed.

PR filtering and dataset generation: Three test phases (original, test‑only, test + fix) are run to verify that patches turn failing tests into passing ones.

Human verification: Two independent annotators label each sample, with cross‑review and final QA checks.

Findings

Experiments show that while many LLMs achieve high repair rates on Python, their success on other languages drops below 10 %. Performance further declines as task difficulty increases, highlighting multilingual code repair as a critical bottleneck.

Multi‑SWE‑RL and Community Involvement

To foster reinforcement‑learning research for code, the team also released Multi‑SWE‑RL, providing standardized RL training data and Docker environments. Over 4,700 instances are available, and the project invites contributions through detailed tutorials and incentive mechanisms.

Resources

Paper: https://arxiv.org/abs/2504.02605 Dataset: https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-bench Code: https://github.com/multi-swe-bench/multi-swe-bench

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

software engineering reinforcement learning Dataset multilingual code repair LLM benchmark

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.