Information Security 26 min read

Securing LLM Code Interpreter: Sandbox Strategies and Real‑World Pitfalls

This article examines why RAG systems need a Code Interpreter, explains the dangers of executing LLM‑generated code with exec(), and presents three sandbox designs—restricted exec, Docker containers, and E2B cloud sandboxes—along with whitelist/blacklist rules, an eight‑step execution flow, and practical lessons learned from production deployment.

Wu Shixiong's Large Model Academy

Mar 31, 2026

Securing LLM Code Interpreter: Sandbox Strategies and Real‑World Pitfalls

Why RAG Systems Need a Code Interpreter

Standard RAG (Retrieval‑Augmented Generation) can answer knowledge queries but cannot perform calculations, process structured data, or generate charts. When a user uploads an Excel file with 100 rows of financial data, the system must read the file, generate analysis code, execute it, and return visual results. The Code Interpreter enables this transition from pure knowledge retrieval to data‑driven computation.

Direct exec() Is Extremely Dangerous

Using exec() to run LLM‑generated Python code without protection opens three real attack surfaces:

Deletion of server files (e.g., os.system("rm -rf /data/production")).

Reading sensitive system files (e.g., opening /etc/passwd).

Exfiltrating data over the network (e.g., sending a DataFrame as JSON via requests.get).

These threats are especially critical in banking contexts where data leakage incurs compliance penalties.

Three Sandbox Design Options

We evaluated three sandbox solutions with increasing security, cost, and complexity.

Option 1: Restricted Python exec Environment (Low Cost, Medium Security)

Limit exec() by providing a controlled globals dictionary that only exposes safe built‑ins and pre‑loaded modules. Perform a static scan for dangerous patterns before execution.

ALLOWED_MODULES = {'pandas', 'numpy', 'matplotlib', 'json', 'math', 'statistics'}
FORBIDDEN_BUILTINS = {'exec', 'eval', 'open', '__import__', 'compile'}

def safe_exec(code: str, user_data: dict = None) -> dict:
    safe_globals = {
        '__builtins__': {k: v for k, v in __builtins__.items() if k not in FORBIDDEN_BUILTINS},
        'pd': __import__('pandas'),
        'np': __import__('numpy'),
        'plt': __import__('matplotlib.pyplot'),
    }
    if user_data:
        safe_globals.update(user_data)
    forbidden_patterns = ['import os', 'import sys', 'subprocess', '__class__', '__bases__', 'open(']
    for pattern in forbidden_patterns:
        if pattern in code:
            raise SecurityError(f"Forbidden usage: {pattern}")
    local_vars = {}
    exec(code, safe_globals, local_vars)
    return local_vars

This approach is suitable for internal tools but can be bypassed via Python’s object‑inheritance tricks (e.g., [].__class__.__bases__[0].__subclasses__()).

Option 2: Docker Container Isolation (High Security, Startup Latency)

Each code execution runs inside a fresh Docker container with network disabled, read‑only filesystem, memory and CPU limits, and automatic removal after completion.

import docker, tempfile, os

def execute_in_docker(code: str, user_data_path: str) -> dict:
    client = docker.from_env()
    with tempfile.TemporaryDirectory() as tmpdir:
        code_file = os.path.join(tmpdir, 'script.py')
        with open(code_file, 'w') as f:
            f.write(code)
        container = client.containers.run(
            image='python-sandbox:latest',
            command='python /sandbox/script.py',
            volumes={
                tmpdir: {'bind': '/sandbox', 'mode': 'ro'},
                user_data_path: {'bind': '/data', 'mode': 'ro'},
            },
            network_disabled=True,
            mem_limit='256m',
            cpu_quota=50000,
            read_only=True,
            tmpfs={'/tmp': 'size=64m'},
            remove=True,
            timeout=10,
            detach=False,
        )
        return parse_output(container)

The trade‑off is a 0.5‑2 s container startup delay, mitigated by maintaining a warm‑up container pool.

Option 3: E2B Cloud Sandbox (Recommended for Production)

E2B (e2b.dev) offers a hosted micro‑VM sandbox with high isolation, fast <≈100 ms> startup, file upload/download, and chart output support.

from e2b_code_interpreter import Sandbox

def execute_with_e2b(code: str, csv_data: str) -> dict:
    with Sandbox() as sandbox:
        sandbox.files.write('/home/user/data.csv', csv_data)
        execution = sandbox.run_code(code)
        if execution.error:
            return {'success': False, 'error': execution.error.value}
        charts = [result.png for result in execution.results if result.png]
        return {'success': True, 'output': execution.text, 'charts': charts}

E2B removes the operational burden of maintaining sandbox infrastructure but incurs per‑execution costs.

Systematic Whitelist/Blacklist Design

We adopt a “minimum‑privilege” principle: only modules required for data analysis are allowed, everything else is denied.

Allowed modules (whitelist):

ALLOWED_MODULES = {
    'pandas',        # DataFrames and basic stats
    'numpy',         # Numerical computation
    'matplotlib',    # Plotting
    'seaborn',       # Advanced visualisation
    'json',          # JSON handling
    'math',          # Basic math functions
    'statistics',   # Mean, variance, etc.
    'datetime',      # Date‑time handling
    'collections',   # Counter, defaultdict
}

Forbidden modules and built‑ins (blacklist):

FORBIDDEN_MODULES = {
    'os', 'sys', 'subprocess', 'socket', 'requests', 'urllib', 'http',
    'ftplib', 'smtplib', 'pickle', 'shelve', 'ctypes', 'cffi',
    'importlib', 'builtins'
}
FORBIDDEN_BUILTINS = {
    'exec', 'eval', 'compile', 'open', '__import__', 'vars',
    'globals', 'locals', 'dir', 'getattr', 'setattr', 'delattr'
}

Static regex scanning complements the module list to catch patterns such as __subclasses__, chr(111)+chr(115), and direct exec calls.

import re
DANGEROUS_PATTERNS = [
    r'__class__', r'__bases__', r'__subclasses__', r'__globals__',
    r'__builtins__', r'__code__', r'__dict__', r'chr\(\s*\d+\s*\)',
    r'\\x[0-9a-fA-F]{2}', r'exec\s*\(', r'eval\s*\(', r'open\s*\(',
    r'import\s+os', r'import\s+sys', r'import\s+subprocess'
]

def static_security_check(code: str) -> tuple[bool, str]:
    for pattern in DANGEROUS_PATTERNS:
        if re.search(pattern, code):
            return False, f"Detected dangerous pattern: {pattern}"
    return True, ""

Eight‑Step Secure Execution Flow

We combine all safeguards into a pipeline.

Receive request and data file : Load the Excel file into a pandas.DataFrame, truncating to a maximum of 10 000 rows.

LLM generates code with a constrained prompt that enforces module and time limits.

Static security check using the regex list above; reject unsafe code.

Enter sandbox : Choose the appropriate sandbox (restricted exec for dev, Docker/E2B for prod).

Timeout monitoring with signal (Unix) or threading to abort long‑running code.

Capture output and charts by redirecting stdout and reading any generated PNG files.

Cleanup : Delete temporary files and, for Docker, destroy the container.

Return results as text plus base64‑encoded images for the front‑end.

Code Interpreter secure execution 8-step process

Practical Pitfalls and Solutions

Matplotlib headless error : Set a non‑interactive backend before importing pyplot.

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

Large data causing timeout : Truncate or sample the DataFrame to a safe row limit and embed a reminder in the prompt.

# First layer: limit rows
MAX_ROWS = 10000
if len(df) > MAX_ROWS:
    df = df.sample(n=MAX_ROWS, random_state=42)
# Second layer: prompt reminder
prompt_suffix = f"Note: dataset has {len(df)} rows. For heavy algorithms, sample to ≤1000 rows to stay within 10 s."

Low‑resolution charts : Increase DPI and set a Chinese‑compatible font.

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.dpi'] = 150
plt.rcParams['figure.figsize'] = (10, 6)

Syntax errors in generated code : Run ast.parse() before execution; also use AST to verify imports against the whitelist.

import ast
def syntax_check(code: str) -> tuple[bool, str]:
    try:
        ast.parse(code)
        return True, ""
    except SyntaxError as e:
        return False, f"Syntax error: {e}"

def check_imports(code: str) -> tuple[bool, str]:
    tree = ast.parse(code)
    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for alias in node.names:
                if alias.name.split('.')[0] not in ALLOWED_MODULES:
                    return False, f"Forbidden import: {alias.name}"
        elif isinstance(node, ast.ImportFrom) and node.module:
            if node.module.split('.')[0] not in ALLOWED_MODULES:
                return False, f"Forbidden import: {node.module}"
    return True, ""

How to Answer the Interview Question

Structure the response in three layers:

Describe the threat model (untrusted LLM output and user input, three attack vectors).

Explain the three sandbox options and the trade‑offs (security level, latency, operational cost).

Detail systematic whitelist/blacklist design, static analysis with AST, and the multi‑layered execution pipeline.

Emphasize that a pure whitelist is safer than a blacklist and that language‑level restrictions alone cannot fully prevent sandbox escape, so OS‑level isolation (Docker/E2B) is recommended for production.

Conclusion

Code Interpreter transforms an RAG system into a powerful data‑analysis assistant, but the expanded capability introduces serious security responsibilities. The recommended defense‑in‑depth approach combines prompt engineering, static code analysis, sandbox execution, and timeout monitoring. In our production deployment, the solution intercepted dozens of unsafe code attempts over three months without any security incidents, proving its effectiveness.

Docker Python LLM RAG security sandbox code interpreter

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.