Applying Big Data and Naive Bayes to Optimize Defect Prioritization in Software Testing
This article explores how software testers can leverage big data and data‑mining techniques, particularly a Naive Bayes classifier, to objectively prioritize defect fixes, improve testing processes, and address uncertainties inherent in traditional testing workflows.
Big data has become a hot topic, and under the data‑driven development trend testers inevitably have to confront it; this article shares a newcomer’s reflections on how data mining can be applied in the testing field, hoping to raise awareness of data importance and sensitivity.
The author’s team is responsible for quality assurance of 360’s commercial advertising platform and big‑data products such as 360 Analysis, 360 ShopEasy, and 360 DMP. Testing these products requires big‑data processing capabilities and specific test strategies, and the testing process itself generates valuable defect data that can be mined for insights. Big data serves as a remedy for uncertainty in testing workflows, offering new ideas and challenges.
Testing faces many uncertainties—project duration estimation, defect‑fix priority assignment, module defect proneness, and optimal manpower allocation. Traditionally these decisions rely on personal experience, which can be inconsistent or even wrong. A more scientific, objective approach is needed, and big data provides one such solution. Testers have access to abundant data: years of defect records, case‑construction metrics, code complexity, product research data, and more.
To illustrate the potential, the article describes a concrete use case: using data mining to determine defect‑fix priority (P1‑P5) with a Naive Bayes classifier. The problem is framed as a classification task, and Naive Bayes is chosen for its simplicity and practicality.
The solution workflow is divided into three stages:
Step 1 – Preparation: Select relevant features and gather training samples. Features are attributes stored in defect‑management tools (e.g., Bugzilla) such as title, version, operating system, type, discovery phase, and description. High‑impact features receive higher initial weights; low‑impact ones are discarded or given minimal weight.
After feature selection, training samples are defined by pairing feature values with manually labeled priority outcomes. Sample diversity, uniformity, and representativeness are considered—for example, front‑end UI bugs are usually quick to fix, whereas back‑end logic bugs are costlier and require earlier attention.
Step 2 – Training: Iteratively train the model while adhering to three principles: (1) Feature selection is crucial; repeated training helps eliminate noisy features and isolate dominant ones. (2) Parameter settings must be rigorous, including initial weights for features and class categories. (3) Sample updates must be timely, reflecting new defect data from ongoing iterations to maintain diversity and balance.
Step 3 – Application: Deploy the trained model into the defect‑management system. When a tester fills in all attribute values for a new defect and submits it, the system automatically suggests an appropriate priority, reducing subjectivity and improving lifecycle management.
The article concludes that big data can solve many more testing problems beyond priority prediction. Testers should shift their mindset from exhaustive input‑output testing to data‑driven analysis, enhance their ability to acquire and analyze data, and become familiar with common analytics methods, predictive algorithms, big‑data frameworks, and data‑processing languages. Fortunately, most algorithms and models are open‑source, enabling targeted adoption and continuous optimization.
Looking ahead, the author encourages testing professionals to explore their own new world in the era of big data.
Qtest之道
360 Quality & Efficiency
360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.