RJUA‑QA: A Comprehensive Urology QA Dataset for Large Language Model Evaluation
RJUA‑QA is a newly released, large‑scale urology question‑answer dataset constructed from virtual patient records based on clinical experience, featuring 2,132 QA pairs with extensive context, designed to benchmark and improve large language models’ medical reasoning, diagnosis, and treatment recommendation capabilities.
Technical Report
This document introduces RJUA‑QA, a specialized QA reasoning dataset for urology, jointly developed by AntGroup and the Renji Hospital Urology Department of Shanghai Jiao Tong University. The dataset is derived from virtual patient cases rewritten from real clinical experiences, containing no private patient data.
Dataset Construction and Characteristics
The data source spans five years (2019‑2023) of diverse clinical scenarios, covering outpatient, emergency, inpatient surgery, and educational resources. It includes ten sub‑specialties such as urological tumors, stones, prostate hyperplasia, and kidney transplantation, representing 97.6% of urology visits.
Key characteristics:
Real clinical background with virtual patient data.
High diversity across organs, sub‑specialties, and diseases.
Explainability through detailed specialist evidence and reasoning.
Precision and scientific rigor aligned with clinical practice.
Dataset Overview
The dataset comprises 2,132 QA pairs and over 25,000 diagnostic evidences. It covers 67 common urological diseases, with multi‑diagnosis cases (over 80% of patients) and complex reasoning requirements.
Evaluation Scheme
The benchmark assesses large language models on two aspects: (1) diagnostic and treatment suggestion accuracy using F1 scores (diagnosis F1, advice F1 weighted 2:1), and (2) overall response quality via Rouge‑L. Detailed formulas for precision, recall, F1, and Rouge‑L are provided.
Industry Model Evaluation
Five prominent models (Huatuo, GPT‑3.5, Baichuan, ChatGLM, Tongyi Qianwen) were evaluated on RJUA‑QA. GPT‑3.5 achieved the highest Rouge‑L, while ChatGLM‑3 and Tongyi Qianwen excelled in diagnosis and advice F1 scores.
Conclusion and Future Work
The dataset aims to enhance LLMs’ logical reasoning in medical contexts and serve as a rigorous evaluation benchmark. Future plans include expanding disease coverage, adding rare diseases, enriching multi‑turn dialogue scenarios, and continuously improving the dataset for AI‑driven medical assistance.
Acknowledgements
Thanks to the Renji Hospital urology team, AntGroup medical team, and annotation contributors for their dedicated effort.
Citation
@misc{lyu2023rjuaqa,
title={RJUA-QA: A Comprehensive QA Dataset for Urology},
author={Shiwei Lyu and Chenfei Chi and Hongbo Cai and Lei Shi and Xiaoyan Yang and Lei Liu and Xiang Chen and Deng Zhao and Zhiqiang Zhang and Xianguo Lyu and Ming Zhang and Fangzhou Li and Xiaowei Ma and Yue Shen and Jinjie Gu and Wei Xue and Yiran Huang},
year={2023},
eprint={2312.09785},
archivePrefix={arXiv},
primaryClass={cs.CL}
}AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.