7 min read

Introducing A.S.E: The First Project‑Level AI Code Generation Security Evaluation Framework

The A.S.E (AI Code Generation Security Evaluation) framework provides a comprehensive, project‑level benchmark for assessing the safety, quality, and stability of AI‑generated code across multiple languages and vulnerability types, helping developers and researchers evaluate and improve large language model coding assistants.

Tencent Technical Engineering

Jul 16, 2025

Introducing A.S.E: The First Project‑Level AI Code Generation Security Evaluation Framework

Overview

Recent rapid growth in AI‑assisted programming has led to a structural shift in the code ecosystem. According to GitHub’s 2024 developer report, 76% of developers use AI coding tools daily, generating up to 95 billion lines of code per month—equivalent to a decade of human coding.

With AI coding tools becoming ubiquitous, ensuring the security and reliability of AI‑generated code is a critical challenge for every developer.

A.S.E Core Advantages

Project‑level code generation scenario : Uses real GitHub projects and extracts relevant code context to keep generated code consistent with project structure.

Security‑sensitive scenario design : Employs real‑world CVE‑based tasks to test whether regenerated code contains security issues.

Data privacy & fairness : Applies dual code mutation (structural and semantic) to ensure LLMs see unseen data, guaranteeing unbiased evaluation.

Expert‑level customization & accuracy : Security experts design custom SAST rules for each CVE, ensuring generated code can be detected by vulnerability scanners.

Multi‑language, multi‑vulnerability coverage : Supports Java, Python, Go, JavaScript, PHP and evaluates command injection, SQL injection, XSS, path traversal, etc.

Evaluation Methodology

A.S.E evaluates AI‑generated code from three dimensions: code security (60%), code quality (30%), and generation stability (10%). Security is measured by custom SAST rules detecting common vulnerabilities; quality is judged by successful project integration and syntax checks; stability is assessed by consistency across three generation rounds.

Dataset

The benchmark includes 40 real‑world GitHub projects with associated CVE vulnerabilities and 80 mutated datasets, covering a broad range of languages and vulnerability types.

Future Plans

Expand dataset with more vulnerability types (e.g., OWASP Top 10) and languages.

Improve context extraction algorithms beyond BM25.

Introduce dynamic PoC‑based security validation for higher precision.

Community Involvement

Developers and AI researchers are invited to contribute models, data, or suggestions via GitHub issues, pull requests, and a short feedback questionnaire.