ReXrank is an open-source leaderboard for AI-powered radiology report generation from chest x-ray images. We're setting a new standard in healthcare AI by providing a comprehensive, objective evaluation framework for cutting-edge models. Our mission is to accelerate progress in this critical field by fostering healthy competition and collaboration among researchers, clinicians, and AI enthusiasts. Using diverse datasets like MIMIC-CXR, IU-Xray, and CheXpert Plus, ReXrank offers a robust benchmarking system that evolves with clinical needs and technological advancements. Our leaderboard showcases top-performing models, driving innovation that could transform patient care and streamline medical workflows.
Join us in shaping the future of AI-assisted radiology. Develop your models, submit your results, and see how you stack up against the best in the field. Together, we can push the boundaries of what's possible in medical imaging and report generation.
To evaluate your models, we made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input.
To run the evaluation, use
python evaluate.py <path_to_data> <path_to_predictions>
.
Once you have a built a model that works to your expectations on the MIMIC-CXR test set, you submit it to get official scores on our Private test set. Here's a tutorial on the submission for a smooth evaluation process.
Submission TutorialPlease cite if you find our leaderboard helpful.
To keep up to date with major changes to the leaderboard and dataset, please subscribe here !
Include top models for different datasets. * denotes model trained on this dataset.
ReXGradient is a large-scale private test dataset contains 10,000 studies collected from different medical centers in the US.
Rank | Model | 1/RadCliQ-v1 ↑ | BLEU ↑ | BertScore ↑ | SembScore ↑ | RadGraph ↑ | RaTEScore ↑ | GREEN ↑ | 1/FineRadScore ↑ |
---|---|---|---|---|---|---|---|---|---|
1 2024 |
CheXagent
Stanford |
0.674 | 0.093 | 0.305 | 0.366 | 0.08 | 0.428 | 0.241 | 0.456 |
2 2024 |
CheXpertPlus_MIMIC
Stanford |
0.777 | 0.154 | 0.341 | 0.442 | 0.13 | 0.501 | 0.52 | 0.473 |
3 2024 |
CheXpertPlus_CheX
Stanford |
0.787 | 0.143 | 0.361 | 0.431 | 0.124 | 0.476 | 0.411 | 0.414 |
4 2024 |
CheXpertPlus_CheX_MIMIC
Stanford |
0.83 | 0.169 | 0.372 | 0.442 | 0.154 | 0.517 | 0.489 | 0.465 |
5 2023 |
Cvt2distilgpt2_MIMIC
CSIRO |
0.866 | 0.186 | 0.374 | 0.46 | 0.176 | 0.524 | 0.514 | 0.47 |
6 2023 |
Cvt2distilgpt2_IU
CSIRO |
0.842 | 0.178 | 0.395 | 0.405 | 0.167 | 0.52 | 0.47 | 0.457 |
7 2024 |
MedVersa
Harvard |
1.008 | 0.21 | 0.431 | 0.498 | 0.202 | 0.527 | 0.532 | 0.475 |
8 2023 |
RadFM
SJTU |
0.775 | 0.157 | 0.365 | 0.392 | 0.135 | 0.504 | 0.406 | 0.438 |
9 2023 |
RaDialog
TUM |
0.876 | 0.188 | 0.402 | 0.45 | 0.158 | 0.522 | 0.435 | 0.456 |
10 2023 |
RGRG
TUM |
0.888 | 0.19 | 0.391 | 0.47 | 0.169 | 0.54 | 0.487 | 0.46 |
11 2023 |
VLCI_MIMIC
SYSU |
0.721 | 0.157 | 0.31 | 0.402 | 0.122 | 0.488 | 0.477 | 0.455 |
12 2023 |
VLCI_IU
SYSU |
0.897 | 0.214 | 0.365 | 0.467 | 0.215 | 0.573 | 0.536 | 0.452 |
13 2024 |
LLM-CXR
KAIST |
0.507 | 0.043 | 0.182 | 0.142 | 0.029 | 0.317 | 0.044 | 0.326 |
14 2024 |
GPT4V
OpenAI |
0.629 | 0.075 | 0.214 | 0.337 | 0.138 | 0.47 | 0.497 | 0.43 |
15 2024 |
BiomedGPT_IU
Lehigh University |
0.771 | 0.099 | 0.317 | 0.437 | 0.157 | 0.472 | 0.388 | 0.451 |
16 2024 |
MAIRA-2
Microsoft |
0.963 | 0.205 | 0.436 | 0.462 | 0.187 | 0.559 | 0.531 | 0.475 |
1 2024 |
CheXpertPlus_MIMIC
Stanford |
0.791 | 0.177 | 0.364 | 0.431 | 0.139 | 0.481 | 0.523 | 0.465 |
2 2024 |
CheXpertPlus_CheX
Stanford |
0.748 | 0.165 | 0.333 | 0.395 | 0.148 | 0.502 | 0.468 | 0.425 |
3 2024 |
CheXpertPlus_CheX_MIMIC
Stanford |
0.838 | 0.196 | 0.389 | 0.429 | 0.166 | 0.5 | 0.508 | 0.466 |
4 2024 |
MedVersa
Harvard |
1.248 | 0.172 | 0.438 | 0.48 | 0.188 | 0.527 | 0.524 | 0.467 |
5 2023 |
RadFM
SJTU |
0.737 | 0.132 | 0.338 | 0.375 | 0.131 | 0.466 | 0.405 | 0.429 |
6 2024 |
GPT4V
OpenAI |
0.605 | 0.072 | 0.214 | 0.364 | 0.175 | 0.456 | 0.356 | 0.423 |
MIMIC-CXR contains 377,110 images corresponding to 227,835 radiographic studies performed at the Beth Israel Deaconess Medical Center in Boston, MA. We follow the official split of MIMIC-CXR in the following experiments. * denotes the model was trained on this dataset.
Rank | Model | 1/RadCliQ-v1 ↑ | BLEU ↑ | BertScore ↑ | SembScore ↑ | RadGraph ↑ | RaTEScore ↑ | GREEN ↑ | 1/FineRadScore ↑ |
---|---|---|---|---|---|---|---|---|---|
1 2024 |
CheXagent*
Stanford |
0.741 | 0.113 | 0.346 | 0.347 | 0.148 | 0.474 | 0.257 | 0.355 |
2 2024 |
CheXpertPlus_MIMIC*
Stanford |
0.788 | 0.145 | 0.361 | 0.375 | 0.17 | 0.485 | 0.311 | 0.363 |
3 2024 |
CheXpertPlus_CheX
Stanford |
0.698 | 0.077 | 0.314 | 0.325 | 0.142 | 0.469 | 0.225 | 0.351 |
4 2024 |
CheXpertPlus_CheX_MIMIC*
Stanford |
0.805 | 0.142 | 0.367 | 0.379 | 0.181 | 0.49 | 0.305 | 0.363 |
5 2023 |
Cvt2distilgpt2_MIMIC*
CSIRO |
0.719 | 0.126 | 0.331 | 0.329 | 0.149 | 0.432 | 0.268 | 0.362 |
6 2023 |
Cvt2distilgpt2_IU
CSIRO |
0.613 | 0.055 | 0.303 | 0.191 | 0.103 | 0.448 | 0.164 | 0.347 |
7 2024 |
MedVersa*
Harvard |
1.103 | 0.209 | 0.448 | 0.466 | 0.273 | 0.55 | 0.374 | 0.365 |
8 2023 |
RadFM*
SJTU |
0.65 | 0.087 | 0.313 | 0.259 | 0.109 | 0.45 | 0.185 | 0.351 |
9 2023 |
RaDialog*
TUM |
0.799 | 0.127 | 0.363 | 0.387 | 0.172 | 0.485 | 0.273 | 0.359 |
10 2023 |
RGRG*
TUM |
0.755 | 0.13 | 0.348 | 0.344 | 0.168 | 0.491 | 0.273 | 0.352 |
11 2023 |
VLCI_MIMIC*
SYSU |
0.68 | 0.136 | 0.304 | 0.305 | 0.14 | 0.45 | 0.256 | 0.357 |
12 2023 |
VLCI_IU
SYSU |
0.599 | 0.075 | 0.263 | 0.212 | 0.109 | 0.449 | 0.21 | 0.347 |
13 2024 |
LLM-CXR*
KAIST |
0.516 | 0.037 | 0.181 | 0.156 | 0.046 | 0.341 | 0.043 | 0.307 |
14 2024 |
GPT4V
OpenAI |
0.558 | 0.068 | 0.207 | 0.214 | 0.084 | 0.423 | 0.161 | 0.343 |
15 2024 |
BiomedGPT_IU
Lehigh University |
0.544 | 0.02 | 0.192 | 0.224 | 0.059 | 0.36 | 0.123 | 0.341 |
16 2024 |
MAIRA-2*
Microsoft |
0.694 | 0.088 | 0.308 | 0.339 | 0.131 | 0.517 | 0.224 | 0.359 |
1 2024 |
CheXpertPlus_MIMIC*
Stanford |
0.802 | 0.165 | 0.353 | 0.382 | 0.193 | 0.511 | 0.377 | 0.365 |
2 2024 |
CheXpertPlus_CheX
Stanford |
0.715 | 0.127 | 0.3 | 0.342 | 0.173 | 0.51 | 0.302 | 0.355 |
3 2024 |
CheXpertPlus_CheX_MIMIC*
Stanford |
0.825 | 0.166 | 0.362 | 0.391 | 0.203 | 0.52 | 0.367 | 0.365 |
4 2024 |
MedVersa*
Harvard |
1.063 | 0.193 | 0.43 | 0.315 | 0.273 | 0.554 | 0.421 | 0.361 |
5 2023 |
RadFM*
SJTU |
0.625 | 0.081 | 0.281 | 0.245 | 0.111 | 0.448 | 0.214 | 0.346 |
6 2024 |
GPT4V
OpenAI |
0.549 | 0.065 | 0.204 | 0.19 | 0.085 | 0.429 | 0.127 | 0.331 |
IU Xray contains 7,470 pairs of radiology reports and chest X-rays from Indiana University. We follow the split given by R2Gen. * denotes the model was trained on IU X-ray.
Rank | Model | 1/RadCliQ-v1 ↑ | BLEU ↑ | BertScore ↑ | SembScore ↑ | RadGraph ↑ | RaTEScore ↑ | GREEN ↑ | 1/FineRadScore ↑ |
---|---|---|---|---|---|---|---|---|---|
1 2024 |
CheXagent
Stanford |
0.827 | 0.116 | 0.353 | 0.488 | 0.139 | 0.503 | 0.389 | 0.574 |
2 2024 |
CheXpertPlus_MIMIC
Stanford |
0.988 | 0.178 | 0.386 | 0.593 | 0.169 | 0.585 | 0.661 | 0.622 |
3 2024 |
CheXpertPlus_CheX
Stanford |
0.92 | 0.157 | 0.413 | 0.495 | 0.153 | 0.534 | 0.541 | 0.548 |
4 2024 |
CheXpertPlus_CheX_MIMIC
Stanford |
1.179 | 0.198 | 0.453 | 0.593 | 0.211 | 0.618 | 0.648 | 0.576 |
5 2023 |
Cvt2distilgpt2_MIMIC
CSIRO |
1.126 | 0.199 | 0.422 | 0.609 | 0.209 | 0.606 | 0.682 | 0.608 |
6 2023 |
Cvt2distilgpt2_IU*
CSIRO |
1.283 | 0.244 | 0.482 | 0.548 | 0.265 | 0.62 | 0.686 | 0.563 |
7 2024 |
MedVersa
Harvard |
1.46 | 0.206 | 0.527 | 0.606 | 0.235 | 0.65 | 0.631 | 0.569 |
8 2023 |
RadFM
SJTU |
1.187 | 0.2 | 0.459 | 0.566 | 0.23 | 0.627 | 0.615 | 0.572 |
9 2023 |
RaDialog
TUM |
1.086 | 0.201 | 0.444 | 0.544 | 0.205 | 0.586 | 0.586 | 0.543 |
10 2023 |
RGRG
TUM |
1.174 | 0.216 | 0.437 | 0.602 | 0.223 | 0.62 | 0.665 | 0.596 |
11 2023 |
VLCI_MIMIC
SYSU |
0.913 | 0.139 | 0.364 | 0.483 | 0.22 | 0.578 | 0.474 | 0.488 |
12 2023 |
VLCI_IU*
SYSU |
1.381 | 0.268 | 0.455 | 0.619 | 0.288 | 0.679 | 0.698 | 0.551 |
13 2024 |
LLM-CXR
KAIST |
0.486 | 0.033 | 0.186 | 0.057 | 0.023 | 0.28 | 0.025 | 0.302 |
14 2024 |
GPT4V
OpenAI |
0.708 | 0.076 | 0.274 | 0.405 | 0.146 | 0.517 | 0.651 | 0.55 |
15 2024 |
BiomedGPT_IU*
Lehigh University |
0.956 | 0.142 | 0.375 | 0.522 | 0.213 | 0.543 | 0.523 | 0.543 |
16 2024 |
MAIRA-2
Microsoft |
1.298 | 0.219 | 0.477 | 0.604 | 0.233 | 0.627 | 0.194 | 0.599 |
1 2024 |
CheXpertPlus_MIMIC
Stanford |
1.111 | 0.227 | 0.449 | 0.594 | 0.187 | 0.57 | 0.681 | 0.615 |
2 2024 |
CheXpertPlus_CheX
Stanford |
0.995 | 0.198 | 0.394 | 0.55 | 0.211 | 0.604 | 0.706 | 0.568 |
3 2024 |
CheXpertPlus_CheX_MIMIC
Stanford |
1.249 | 0.244 | 0.476 | 0.598 | 0.232 | 0.606 | 0.694 | 0.588 |
4 2024 |
MedVersa
Harvard |
1.452 | 0.195 | 0.518 | 0.601 | 0.244 | 0.628 | 0.658 | 0.583 |
5 2023 |
RadFM
SJTU |
1.22 | 0.196 | 0.479 | 0.556 | 0.234 | 0.596 | 0.644 | 0.551 |
6 2024 |
GPT4V
OpenAI |
0.683 | 0.079 | 0.235 | 0.403 | 0.16 | 0.519 | 0.399 | 0.528 |
CheXpert Plus contains 223,228 unique pairs of radiology reports and chest X-rays from 187,711 studies and 64,725 patients. We follow the official split of CheXpert Plus in the following experiments and use the valid set for evaluation. * denotes the model was trained on CheXpert Plus.
Rank | Model | 1/RadCliQ-v1 ↑ | BLEU ↑ | BertScore ↑ | SembScore ↑ | RadGraph ↑ | RaTEScore ↑ | GREEN ↑ | 1/FineRadScore ↑ |
---|---|---|---|---|---|---|---|---|---|
1 2024 |
CheXagent
Stanford |
0.638 | 0.123 | 0.278 | 0.269 | 0.125 | 0.434 | 0.183 | 0.341 |
2 2024 |
CheXpertPlus_MIMIC
Stanford |
0.663 | 0.14 | 0.292 | 0.294 | 0.134 | 0.43 | 0.238 | 0.344 |
3 2024 |
CheXpertPlus_CheX*
Stanford |
0.786 | 0.15 | 0.342 | 0.377 | 0.191 | 0.487 | 0.237 | 0.343 |
4 2024 |
CheXpertPlus_CheX_MIMIC*
Stanford |
0.808 | 0.153 | 0.335 | 0.404 | 0.207 | 0.497 | 0.274 | 0.348 |
5 2023 |
Cvt2distilgpt2_MIMIC
CSIRO |
0.626 | 0.124 | 0.267 | 0.266 | 0.119 | 0.42 | 0.215 | 0.346 |
6 2023 |
Cvt2distilgpt2_IU
CSIRO |
0.577 | 0.084 | 0.267 | 0.155 | 0.098 | 0.382 | 0.147 | 0.332 |
7 2024 |
MedVersa
Harvard |
0.719 | 0.129 | 0.323 | 0.344 | 0.147 | 0.47 | 0.243 | 0.343 |
8 2023 |
RadFM
SJTU |
0.572 | 0.081 | 0.235 | 0.216 | 0.08 | 0.396 | 0.096 | 0.333 |
9 2023 |
RaDialog
TUM |
0.709 | 0.131 | 0.312 | 0.353 | 0.138 | 0.445 | 0.211 | 0.333 |
10 2023 |
RGRG
TUM |
0.674 | 0.154 | 0.315 | 0.274 | 0.14 | 0.453 | 0.216 | 0.337 |
11 2023 |
VLCI_MIMIC
SYSU |
0.589 | 0.12 | 0.229 | 0.251 | 0.101 | 0.384 | 0.165 | 0.33 |
12 2023 |
VLCI_IU
SYSU |
0.555 | 0.106 | 0.22 | 0.17 | 0.094 | 0.418 | 0.194 | 0.339 |
13 2024 |
LLM-CXR
KAIST |
0.519 | 0.041 | 0.162 | 0.211 | 0.037 | 0.321 | 0.022 | 0.291 |
14 2024 |
GPT4V
OpenAI |
0.568 | 0.081 | 0.215 | 0.234 | 0.082 | 0.415 | 0.152 | 0.339 |
15 2024 |
BiomedGPT_IU
Lehigh University |
0.552 | 0.022 | 0.2 | 0.241 | 0.056 | 0.351 | 0.118 | 0.32 |
16 2024 |
MAIRA-2
Microsoft |
0.788 | 0.163 | 0.359 | 0.355 | 0.189 | 0.485 | 0.273 | 0.352 |
1 2024 |
CheXpertPlus_MIMIC
Stanford |
0.482 | 0.103 | 0.002 | 0.318 | 0.049 | 0.429 | 0.293 | 0.347 |
2 2024 |
CheXpertPlus_CheX*
Stanford |
0.512 | 0.142 | 0.02 | 0.38 | 0.07 | 0.492 | 0.363 | 0.353 |
3 2024 |
CheXpertPlus_CheX_MIMIC*
Stanford |
0.511 | 0.14 | 0.011 | 0.388 | 0.071 | 0.503 | 0.382 | 0.36 |
4 2024 |
MedVersa
Harvard |
0.493 | 0.09 | 0.013 | 0.337 | 0.05 | 0.452 | 0.334 | 0.354 |
5 2023 |
RadFM
SJTU |
0.443 | 0.067 | -0.038 | 0.229 | 0.027 | 0.39 | 0.137 | 0.34 |
6 2024 |
GPT4V
OpenAI |
0.431 | 0.055 | -0.065 | 0.208 | 0.028 | 0.393 | 0.182 | 0.329 |