Can ChatGPT Boost Fuzz Testing? A Practical Exploration of AI‑Generated Test Cases
This article examines how ChatGPT can be combined with fuzz testing to automatically generate test cases, detailing the research process, design, implementation, performance metrics, discovered bugs, limitations, and future plans for improving AI‑driven automated testing in a real‑world e‑commerce environment.
Introduction
Internal platforms at Youzan (Insight, Horizons, traffic replay, page comparison, Data Factory) have accumulated QA practices. With rapid AI progress, the author fed business scenarios to ChatGPT to generate textual test cases. After minor adjustments, the cases were usable for functional testing. The author then explored combining ChatGPT‑generated cases with fuzz testing to increase coverage and uncover hidden bugs.
Research Process
What is Fuzzing?
Fuzzing automatically creates large numbers of random inputs from seed cases to verify program reliability, especially for edge‑case and exception scenarios. The main difficulty is generating effective random data, which usually requires modeling and algorithm design.
Feasibility of ChatGPT‑Generated Test Cases
Using a de‑identified PRD excerpt about member pricing, the author prompted ChatGPT to produce test cases. With sufficient context, ChatGPT generated reasonable cases that required only minor tweaks before functional execution.
Combining Fuzzing and ChatGPT
The proposed workflow treats fuzzing as the core engine and ChatGPT as a rule‑mutator that produces large numbers of test cases. Coverage metrics are collected after each execution to decide whether to continue generating cases, aiming to discover bugs and improve automation efficiency.
Design and Implementation
Overall Idea
The pipeline leverages existing Youzan services (Insight, Zan‑Hunter) to obtain seed templates, builds prompts for ChatGPT, receives JSON‑formatted test cases, executes them via Java APIs, gathers line‑coverage data, and iterates until coverage plateaus (≈70‑80%). The high‑level steps are:
Insight/Zan‑Hunter provides seed case templates.
Generate prompts for ChatGPT.
ChatGPT returns test cases in JSON.
Execute cases against the target service.
Collect line‑coverage metrics and decide whether to continue.
Functional Design
Phase 1: Front‑end for displaying generated recommendations; back‑end CRUD for test cases.
Phase 1 core: Backend logic that builds prompts and calls GPT‑3.5‑turbo to generate random input parameters.
Phase 1 core: Bug‑mining after case execution.
Phase 2 (planned): Automatic coverage collection, comparison, effective‑case filtering, and iterative generation.
Implementation Status
Phase 1 is completed. The service can:
Generate recommended cases via ChatGPT.
Create and run test suites in Insight.
Write back assertions based on the first execution result.
Export results to Insight and Feishu documents for manual review.
Based on ChatGPT, the recommendation service:
- Front‑end page accepts generation rules and shows parameter templates.
- Back‑end uses GPT‑3.5 to generate accurate random input parameters.
- Integrates with Insight to create and run test suites.Current Usage Results
Generation Speed
Generating and executing 1,000 cases (each with <10 input fields) takes about 8 minutes, far faster than manual case creation.
Generation Quality (Bug Discovery)
Testing 25 interfaces (~3,400 generated cases) uncovered four valid issues: two SQL‑injection errors and two Null‑Pointer‑Exception handling problems. The remaining interfaces correctly rejected malformed inputs, indicating robust validation.
Conclusions
The tool acts as a fuzzing assistant that creates random‑parameter test suites, supports regression verification across code versions, and complements existing functional tests. It cannot fully replace manual or scenario‑driven testing because it lacks rich business‑semantic data generation.
Tool Capabilities
Create and run random‑parameter fuzz tests.
Batch‑generate non‑semantic test cases.
Use generated suites for regression verification across code versions.
Limitations
Cannot replace manual or scenario‑based testing that requires business‑semantic inputs.
Relation to Existing Automation
Existing automated tests rarely cover abnormal input scenarios; this service fills that gap.
It is an auxiliary aid rather than a full replacement.
Future Plans
Planned improvements include:
Prompt tuning and private model fine‑tuning to generate business‑semantic parameters.
Engineering optimizations for stability and latency.
Phase 2 integration of code‑coverage feedback to iteratively refine case generation.
Problems and Solutions
Parameter Recommendation Accuracy
Incorrect or non‑semantic data reduces test effectiveness. Two mitigation strategies are proposed:
Fine‑tune a private model on domain data.
Enforce deterministic generation rules using MVEL templates (e.g., {"kdtId":"MVEL(1 || 55 || 160)"}).
Prompt Design
Chinese Prompt
Initial concise Chinese prompts suffered from token limits and occasional misinterpretation. Switching to JSON‑structured rules improved accuracy and reduced token consumption.
English Prompt
English prompts yielded faster responses but occasional format errors. Adding example cases, function‑calling hints, and strict JSON output rules mitigated instability. The author uses the gpt-3.5-turbo-16k-0613 model with function calling to enforce minified JSON output.
### English Prompt Example
"""
Base on fuzzing test and {{Mvel Expression}}, generate {{generateNum}} random {{generateParam}} by given {{paramTemplate}}.
[OutputRule]:
- only minify json.
- set results in {{generatedParams}} array.
"""Service Stability
Four typical failure categories were identified and mitigated:
Network instability – retry logic with max retry count and alerting.
Malformed JSON – post‑processing to repair or discard and retry.
Content quality (duplicates, omissions) – pre‑filter duplicate cases and regenerate omitted data.
Generation latency – throttle request rate and monitor account‑level rate limits.
Assertion Strategy
Initially, a simple 200‑status‑code check was insufficient. The final approach records the first execution result as the baseline assertion for each case. Subsequent runs compare against this baseline; deviations are flagged as potential bugs. A quick 200‑code filter is still used to produce an initial report for manual review.
Reference: TCP‑Fuzz: Detecting Memory and Semantic Bugs in TCP Stacks with Fuzzing
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
