Operations 14 min read

Automate Jira Defect Cleaning: Python Script to Generate Smart Excel Reports

Learn how to set up a Python-based workflow that reads Jira export CSV files, cleans and enriches defect data, infers root causes, generates test suggestions, and outputs a formatted Excel workbook with data validation, dropdowns, and sensitive data masking, all configurable via YAML and JSON.

Test Development Learning Exchange

Mar 22, 2026

Automate Jira Defect Cleaning: Python Script to Generate Smart Excel Reports

1. Preparation

Install the required Python packages:

pip install pandas openpyxl jira python-dateutil

The project follows this directory layout:

defect_cleaner/
├── config/
│   └── module_tree.yaml   # module tree configuration
├── templates/
│   └── test_suggestions.json   # test suggestion templates
├── scripts/
│   └── clean_defects.py   # main cleaning script
└── output/
    └── cleaned_defects.xlsx   # generated Excel file

2. Configuration Files

module_tree.yaml defines a hierarchical list of business modules:

模块分类:
  - 用户中心:
      - 注册/登录
      - 个人资料
      - 安全设置
  - 支付中心:
      - 微信支付
      - 支付宝
      - 退款
  - 订单系统:
      - 创建订单
      - 订单查询
      - 状态机
  - 其他:
      - 基础设施
      - 第三方集成

test_suggestions.json maps common defect patterns to test suggestions:

{
  "超时": "模拟网络延迟（3-10秒），验证超时降级和重试逻辑（最多3次）",
  "空值": "注入 null/空字符串/特殊字符参数，验证防御性编程和错误提示",
  "并发": "使用 JMeter 进行并发压测（≥1000 TPS），验证数据一致性和锁机制",
  "边界": "使用边界值分析法，测试字段的 min-1, min, max, max+1 场景",
  "幂等": "重复提交相同请求，验证系统是否返回相同结果且无副作用"
}

3. Core Cleaning Script (scripts/clean_defects.py)

The script performs the following steps:

Load the module tree and test suggestion templates.

Read the raw Jira CSV export into a pandas DataFrame.

Map raw fields to a normalized schema (defect_id, title, description, etc.).

Combine description‑related columns into a single cleaned description.

Infer a module based on keyword matching in the summary.

Detect root causes using simple keyword heuristics.

Generate a test suggestion by looking up the inferred cause.

Mask sensitive information (phone numbers, email addresses).

Normalize risk level values and parse creation dates.

Key functions (shown in full for reference):

import pandas as pd
import yaml, json, re, os
from datetime import datetime
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Font, Alignment
from openpyxl.worksheet.datavalidation import DataValidation
from openpyxl.utils import get_column_letter
from dateutil import parser

# ---------- Configuration loading ----------
def load_config():
    with open("config/module_tree.yaml", "r", encoding="utf-8") as f:
        module_config = yaml.safe_load(f)
    with open("templates/test_suggestions.json", "r", encoding="utf-8") as f:
        test_templates = json.load(f)
    flat_modules = []
    for category, submodules in module_config["模块分类"].items():
        for submodule in submodules:
            flat_modules.append(f"{category} > {submodule}")
    return flat_modules, test_templates

# ---------- Data cleaning ----------
def clean_jira_data(df: pd.DataFrame, test_templates: dict) -> pd.DataFrame:
    cleaned = pd.DataFrame()
    cleaned["defect_id"] = df["Issue key"]
    cleaned["title"] = df["Summary"].str[:50].fillna("无标题")
    # Merge description fields
    desc_parts = []
    for col in ["Description", "Steps to Reproduce", "Expected Result", "Actual Result"]:
        if col in df.columns:
            desc_parts.append(df[col].fillna(""))
    cleaned["description"] = "
".join(desc_parts).replace("

", "
")
    # Simple module mapping
    def map_module(summary):
        s = str(summary).lower()
        if any(k in s for k in ["支付", "pay", "微信", "支付宝"]):
            return "支付中心 > 微信支付"
        if any(k in s for k in ["登录", "注册", "login", "signup"]):
            return "用户中心 > 注册/登录"
        if "订单" in s or "order" in s:
            return "订单系统 > 创建订单"
        return "其他 > 基础设施"
    cleaned["module"] = df["Summary"].apply(map_module)
    # Root‑cause inference
    def infer_root_cause(desc):
        d = str(desc).lower()
        if "timeout" in d or "超时" in d or "timed out" in d:
            return "第三方服务超时，无重试机制"
        if "null" in d or "空指针" in d or "npe" in d:
            return "未校验空值，导致空指针异常"
        if "concurrent" in d or "并发" in d or "race" in d:
            return "并发操作未加锁，导致数据不一致"
        if "boundary" in d or "边界" in d or "max" in d:
            return "未处理边界条件，导致溢出或异常"
        if "idempotent" in d or "幂等" in d:
            return "接口未实现幂等性，重复请求产生副作用"
        return "需人工补充"
    cleaned["root_cause"] = cleaned["description"].apply(infer_root_cause)
    # Test suggestion generation
    def generate_test_suggestion(cause):
        for kw, sug in test_templates.items():
            if kw in cause:
                return sug
        return "需人工补充"
    cleaned["test_suggestion"] = cleaned["root_cause"].apply(generate_test_suggestion)
    # Other fields
    cleaned["solution"] = df.get("Resolution", pd.Series(["需人工补充"] * len(df)))
    cleaned["risk_level"] = df.get("Priority", pd.Series(["中"] * len(df))).map({
        "Highest": "高", "High": "高", "Medium": "中", "Low": "低", "Lowest": "低"
    }).fillna("中")
    # Date parsing
    def parse_date(date_str):
        try:
            return parser.parse(str(date_str)).date() if pd.notna(date_str) else datetime.now().date()
        except:
            return datetime.now().date()
    cleaned["created_date"] = df["Created"].apply(parse_date)
    # Sensitive data masking
    def mask_sensitive(text):
        if pd.isna(text):
            return ""
        text = re.sub(r"1[3-9]\d{9}", "1****0000", str(text))
        text = re.sub(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", "user@***.com", str(text))
        return text
    cleaned["description"] = cleaned["description"].apply(mask_sensitive)
    return cleaned

# ---------- Excel formatting ----------
def apply_excel_validation(output_path: str, module_list: list):
    wb = load_workbook(output_path)
    ws = wb.active
    # Column widths
    column_widths = {"A":15, "B":30, "C":50, "D":25, "E":30, "F":40, "G":15, "H":50, "I":15}
    for col, width in column_widths.items():
        ws.column_dimensions[col].width = width
    # Header style
    header_fill = PatternFill(start_color="4F81BD", end_color="4F81BD", fill_type="solid")
    header_font = Font(color="FFFFFF", bold=True)
    for cell in ws[1]:
        cell.fill = header_fill
        cell.font = header_font
        cell.alignment = Alignment(horizontal="center", vertical="center")
    max_row = ws.max_row
    # Module dropdown
    module_validation = DataValidation(type="list", formula1='"' + ",".join(module_list) + '"', allow_blank=False)
    module_validation.error = "请选择有效模块！"
    module_validation.errorTitle = "无效输入"
    module_validation.prompt = "请选择业务模块"
    module_validation.promptTitle = "模块选择"
    ws.add_data_validation(module_validation)
    for row in range(2, max_row + 1):
        module_validation.add(f"D{row}")
    # Risk level dropdown
    risk_validation = DataValidation(type="list", formula1='"高,中,低"', allow_blank=False)
    risk_validation.error = "风险等级只能是：高、中、低"
    ws.add_data_validation(risk_validation)
    for row in range(2, max_row + 1):
        risk_validation.add(f"G{row}")
    # Highlight required fields when empty or placeholder
    red_fill = PatternFill(start_color="FFC7CE", end_color="FFC7CE", fill_type="solid")
    required_cols = ["B", "D", "E", "G"]
    for col in required_cols:
        for row in range(2, max_row + 1):
            cell = ws[f"{col}{row}"]
            if not cell.value or str(cell.value).strip() == "需人工补充":
                cell.fill = red_fill
    wb.save(output_path)
    print(f"✅ Excel data validation applied! File: {output_path}")

# ---------- Main entry ----------
def main():
    module_list, test_templates = load_config()
    input_file = "jira_export.csv"
    if not os.path.exists(input_file):
        print(f"❌ Input file not found: {input_file}")
        return
    df = pd.read_csv(input_file, encoding="utf-8-sig")
    print(f"📥 Read {len(df)} raw defect records")
    cleaned_df = clean_jira_data(df, test_templates)
    print(f"🧹 Cleaning done! {len(cleaned_df[cleaned_df['root_cause'] == '需人工补充'])} rows need manual review")
    output_path = "output/cleaned_defects.xlsx"
    os.makedirs("output", exist_ok=True)
    cleaned_df.to_excel(output_path, index=False, sheet_name="Defects")
    apply_excel_validation(output_path, module_list)
    print("
🎉 Defect knowledge base cleaning completed!")
    print("🔍 Next steps:")
    print("   1. Open cleaned_defects.xlsx")
    print("   2. Check red‑highlighted cells (manual fill needed)")
    print("   3. Use dropdowns to correct module fields")
    print("   4. Import into ChromaDB to build an AI agent")

if __name__ == "__main__":
    main()

4. Excel Output Features

Module column provides a dropdown populated from the configured module tree.

Risk level column offers a fixed list: 高, 中, 低.

Required fields (title, module, root cause, risk level) are highlighted in red when empty or left as the placeholder "需人工补充".

Hover prompts guide users (e.g., "请选择业务模块").

Invalid entries trigger error dialogs (e.g., "请选择有效模块！").

5. Security and Extensibility Suggestions

The script already masks phone numbers and email addresses. To strengthen privacy, you can add ID masking:

text = re.sub(r"\d{6}\*\*\*\*\*\*\d{4}", "***************", text)

For a more direct data source, replace the CSV export with the Jira REST API:

from jira import JIRA
jira = JIRA(server="https://your-jira.com", basic_auth=("user", "token"))
issues = jira.search_issues("project = PAY AND created >= -30d")
# Process `issues` similarly to the CSV DataFrame

To improve module classification beyond simple keyword matching, you can integrate a Sentence‑BERT model:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# Compute similarity between issue summary and module descriptions

6. Conclusion

By spending roughly one hour configuring this script, a team can replace dozens of manual hours spent cleaning Jira defect data, producing a high‑quality, validated Excel knowledge base that serves as a solid foundation for downstream AI agents and analytics.

Python Automation data cleaning data validation Excel Automation jira

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.