Artificial Intelligence 12 min read

RJUA‑QA: A Comprehensive Urology QA Dataset for Large Language Model Evaluation

RJUA‑QA is a newly released, large‑scale urology question‑answer dataset constructed from virtual patient records based on clinical experience, featuring 2,132 QA pairs with extensive context, designed to benchmark and improve large language models’ medical reasoning, diagnosis, and treatment recommendation capabilities.

AntTech

Dec 19, 2023

RJUA‑QA: A Comprehensive Urology QA Dataset for Large Language Model Evaluation

Technical Report

This document introduces RJUA‑QA, a specialized QA reasoning dataset for urology, jointly developed by AntGroup and the Renji Hospital Urology Department of Shanghai Jiao Tong University. The dataset is derived from virtual patient cases rewritten from real clinical experiences, containing no private patient data.

Dataset Construction and Characteristics

The data source spans five years (2019‑2023) of diverse clinical scenarios, covering outpatient, emergency, inpatient surgery, and educational resources. It includes ten sub‑specialties such as urological tumors, stones, prostate hyperplasia, and kidney transplantation, representing 97.6% of urology visits.

Key characteristics:

Real clinical background with virtual patient data.

High diversity across organs, sub‑specialties, and diseases.

Explainability through detailed specialist evidence and reasoning.

Precision and scientific rigor aligned with clinical practice.

Dataset Overview

The dataset comprises 2,132 QA pairs and over 25,000 diagnostic evidences. It covers 67 common urological diseases, with multi‑diagnosis cases (over 80% of patients) and complex reasoning requirements.

Evaluation Scheme

The benchmark assesses large language models on two aspects: (1) diagnostic and treatment suggestion accuracy using F1 scores (diagnosis F1, advice F1 weighted 2:1), and (2) overall response quality via Rouge‑L. Detailed formulas for precision, recall, F1, and Rouge‑L are provided.

Industry Model Evaluation

Five prominent models (Huatuo, GPT‑3.5, Baichuan, ChatGLM, Tongyi Qianwen) were evaluated on RJUA‑QA. GPT‑3.5 achieved the highest Rouge‑L, while ChatGLM‑3 and Tongyi Qianwen excelled in diagnosis and advice F1 scores.

Conclusion and Future Work

The dataset aims to enhance LLMs’ logical reasoning in medical contexts and serve as a rigorous evaluation benchmark. Future plans include expanding disease coverage, adding rare diseases, enriching multi‑turn dialogue scenarios, and continuously improving the dataset for AI‑driven medical assistance.

Acknowledgements

Thanks to the Renji Hospital urology team, AntGroup medical team, and annotation contributors for their dedicated effort.

Citation

@misc{lyu2023rjuaqa,
  title={RJUA-QA: A Comprehensive QA Dataset for Urology},
  author={Shiwei Lyu and Chenfei Chi and Hongbo Cai and Lei Shi and Xiaoyan Yang and Lei Liu and Xiang Chen and Deng Zhao and Zhiqiang Zhang and Xianguo Lyu and Ming Zhang and Fangzhou Li and Xiaowei Ma and Yue Shen and Jinjie Gu and Wei Xue and Yiran Huang},
  year={2023},
  eprint={2312.09785},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

evaluation Dataset medical-ai QA dataset Urology

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.