Artificial Intelligence 8 min read

How UnitGen Generates High‑Quality Code Datasets for Private AI Models

UnitGen, a dataset generation framework derived from UnitEval, combines unified prompts, quality pipelines, and extensible thresholds with language‑specific context strategies and ArchGuard checks to produce both documentation and test datasets for private AI code‑generation models, leveraging the open‑source Chapi AST engine.

phodal

Jan 7, 2024

How UnitGen Generates High‑Quality Code Datasets for Private AI Models

Overview

UnitGen is a code‑dataset generation framework derived from UnitEval. It creates high‑quality coding and test datasets for private deployments of open‑source AI models and can be combined with the AutoDev plugin to incorporate an organization’s existing fine‑tuning data.

Design Principles

Unified Prompt – a single prompt format is used across code generation, fine‑tuning, and evaluation.

Code Quality Pipeline – automatic checks for code complexity, code smells, test smells, and API design issues.

Extensible Quality Thresholds – custom rules, thresholds, and quality categories can be added.

Architecture

The system consists of two core components:

LanguageWorker‑based context strategy – handles language‑specific syntax differences (e.g., documentation locations, data‑behavior mappings).

ArchGuard quality inspection – performs length checks, minification checks (for JavaScript), and serves as a threshold engine for code, test code, and MVC code quality.

At the heart of UnitGen is the open‑source Chapi parser, which converts source files of many languages into a unified hierarchical abstract syntax tree (AST). Because ArchGuard also relies on Chapi, the two tools share a compatible analysis engine.

Document Dataset Generation

Documentation entries are generated by locating comment blocks at class and method levels and attaching them to the corresponding code structures. In Chapi a source file is represented as a Container that holds multiple DataStructure objects, each containing a list of Function s. The following Kotlin‑style snippet shows the comment‑building loop:

container.DataStructures.forEach { dataStruct ->
    // build class comment
    val methodCommentIns = dataStruct.Functions
        .filter { it.Name != "PrimaryConstructor" }
        .map { function ->
            // build method comment
        }
    // return comments
}

After constructing the comment blocks, basic quality checks are applied before linking the comments to their code elements.

Test Dataset Generation

Test generation requires awareness of the project’s test framework and its dependencies. UnitGen performs Software Composition Analysis (SCA) on dependency manifests (e.g., package.json, build.gradle) to infer the primary framework and test runner.

Detected React with Jest from package.json.

Detected Spring Boot with spring-boot-starter-test from build.gradle.

Two internal structures manage this information: ProjectContext – stores overall project metadata and detected frameworks. TestStack – holds test‑framework‑specific configuration.

The SCA analyzer ( analyser_sca) in ArchGuard provides language‑specific mappings for popular frameworks.

Function‑Level Test Generation

For method‑level test creation, UnitGen builds a call graph (CG) using Chapi’s static analysis. By traversing each function’s FunctionCalls, it matches test methods to the functions they exercise. The core logic is illustrated below:

dataStruct.Functions.mapIndexed { _, function ->
    val canonicalName = it.Package + "." + it.NodeName + ":" + it.FunctionName
    if (it.NodeName != underTestFile.NodeName) return@mapIndexed
    val originalContent = underTestFunctionMap[canonicalName] ?: return@mapIndexed
    if (originalContent.isBlank() || function.Content.isBlank()) return@mapIndexed
    // generate test instruction
}

Because naming conventions for test files vary, UnitGen also validates the naming style:

val namingStyle = dataStruct.checkNamingStyle()

External Dependencies

UnitGen relies on the open‑source Chapi project (https://github.com/phodal/chapi) for language‑agnostic AST handling. Chapi supports major languages and provides detailed FunctionCall information, as shown in the language‑support diagram:

Further documentation is available at https://chapi.phodal.com/.

Additional Notes

External test generation leverages open‑source examples from ThoughtWorks, Spring Data, and ArchUnit to provide realistic test scaffolding. During the creation of the second AutoDev Coder dataset, a subset of generated code was manually reviewed and refined with OpenAI assistance to improve overall data quality.

AI code generation software testing AST analysis open-source tools dataset generation

Written by

phodal

A prolific open-source contributor who constantly starts new projects. Passionate about sharing software development insights to help developers improve their KPIs. Currently active in IDEs, graphics engines, and compiler technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.