Prioritizing Code Redline Scan Bugs Using Regression Models and Feature Engineering

This article presents a regression‑based approach that leverages historical bug‑fix data, feature selection, data cleaning, and model tuning to reorder redline scan warnings so developers can focus on the most critical fixes, improving overall repair efficiency.

360 Quality & Efficiency
360 Quality & Efficiency
360 Quality & Efficiency
Prioritizing Code Redline Scan Bugs Using Regression Models and Feature Engineering

In the code redline scanning workflow, numerous bugs and warnings are reported, but developers often ignore the default priority order (red, safe, block, serious, risk, warning, style, suggestion, normal), leading to inefficient triage.

To help developers quickly locate the bugs that must be fixed, the authors propose using past fix histories to predict the likelihood of a bug being repaired and then re‑rank new scan results accordingly.

Goal and Evaluation

The objective is to predict the repair probability of each scanned bug (a regression problem) and sort the list by this probability. Common regression metrics such as explained_variance_score, mean_absolute_error, mean_squared_error, and r2_score are used for evaluation, with special attention to the error between predicted and actual 0/1 labels.

Feature Selection

Based on observed data, the following features are chosen: svn_path (repository path), svn_file (file path), error_id (bug title), msg (description), cat (bug type). The target label is status (0 = unfixed, 1 = fixed).

Data Cleaning

Dirty data arising from version changes and business logic is addressed: duplicate records with inconsistent labels, severe class imbalance, temporal sample drift (e.g., early C++/SVN data vs. later Java/Android/Git data), and mandatory‑fix categories (red, serious, safe, block) are all normalized.

Feature Quantization

Categorical fields with limited vocabularies (error type, project) are vectorized using a dictionary and CountVectorizer. High‑cardinality path features are encoded via hashing and SimHash after segment‑wise accumulation (e.g., "/A/B/C" → ["A", "AB", "ABC"]).

Model Selection

A gradient‑boosted decision tree (sklearn GBDT) is selected for regression.

Hyperparameter Tuning

Parameters are explored using a coarse‑to‑fine grid search with cross‑validation ( cross_val_score) to find the optimal configuration.

Model Evaluation Results

explained_variance_score: 0.9689 mean_absolute_error: 0.00207 mean_squared_error: 0.00167 r2_score: 0.9689 Average error for positive samples: 0.00103 Average error for negative samples: 0.01927

These metrics demonstrate that the model can reliably estimate the repair probability of each bug, enabling an effective re‑ranking of redline scan outputs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningfeature engineeringcode analysisregressionbug prioritization
360 Quality & Efficiency
Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.