ReXrank

Open-Source Radiology Report Generation Leaderboard

What is ReXrank?

ReXrank is an open-source leaderboard for AI-powered radiology report generation from chest x-ray images. We're setting a new standard in healthcare AI by providing a comprehensive, objective evaluation framework for cutting-edge models. Our mission is to accelerate progress in this critical field by fostering healthy competition and collaboration among researchers, clinicians, and AI enthusiasts. Using diverse datasets like MIMIC-CXR, IU-Xray, and CheXpert Plus, ReXrank offers a robust benchmarking system that evolves with clinical needs and technological advancements. Our leaderboard showcases top-performing models, driving innovation that could transform patient care and streamline medical workflows.

Join us in shaping the future of AI-assisted radiology. Develop your models, submit your results, and see how you stack up against the best in the field. Together, we can push the boundaries of what's possible in medical imaging and report generation.

Getting Started

To evaluate your models, we made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input. To run the evaluation, use python evaluate.py <path_to_data> <path_to_predictions> .

Once you have a built a model that works to your expectations on the MIMIC-CXR test set, you submit it to get official scores on our Private test set. Here's a tutorial on the submission for a smooth evaluation process.

Submission Tutorial

Please cite if you find our leaderboard helpful.

To keep up to date with major changes to the leaderboard and dataset, please subscribe here !

Leaderboard Overview

Include top models for different datasets. * denotes model trained on this dataset.

Rank ReXGradient MIMIC-CXR IU-Xray CheXpert Plus

1

MedVersa

Harvard

MedVersa*

Harvard

MedVersa

Harvard

CheXpertPlus_CheX_MIMIC*

Stanford

2

MAIRA-2

Microsoft

CheXpertPlus_CheX_MIMIC*

Stanford

VLCI_IU*

SYSU

MAIRA-2

Microsoft

3

VLCI_IU

SYSU

RaDialog*

TUM

MAIRA-2

Microsoft

CheXpertPlus_CheX*

Stanford

4

RGRG

TUM

CheXpertPlus_MIMIC*

Stanford

Cvt2distilgpt2_IU*

CSIRO

MedVersa

Harvard

5

RaDialog

TUM

RGRG*

TUM

RadFM

SJTU

RaDialog

TUM

6

Cvt2distilgpt2_MIMIC

CSIRO

CheXagent*

Stanford

CheXpertPlus_CheX_MIMIC

Stanford

RGRG

TUM

7

Cvt2distilgpt2_IU

CSIRO

Cvt2distilgpt2_MIMIC*

CSIRO

RGRG

TUM

CheXpertPlus_MIMIC

Stanford

8

CheXpertPlus_CheX_MIMIC

Stanford

CheXpertPlus_CheX

Stanford

Cvt2distilgpt2_MIMIC

CSIRO

CheXagent

Stanford

9

CheXpertPlus_CheX

Stanford

MAIRA-2*

Microsoft

RaDialog

TUM

Cvt2distilgpt2_MIMIC

CSIRO

10

CheXpertPlus_MIMIC

Stanford

VLCI_MIMIC*

SYSU

CheXpertPlus_MIMIC

Stanford

VLCI_MIMIC

SYSU

11

RadFM

SJTU

RadFM*

SJTU

BiomedGPT_IU*

Lehigh University

Cvt2distilgpt2_IU

CSIRO

12

BiomedGPT_IU

Lehigh University

Cvt2distilgpt2_IU

CSIRO

CheXpertPlus_CheX

Stanford

RadFM

SJTU

13

VLCI_MIMIC

SYSU

VLCI_IU

SYSU

VLCI_MIMIC

SYSU

GPT4V

OpenAI

14

CheXagent

Stanford

GPT4V

OpenAI

CheXagent

Stanford

VLCI_IU

SYSU

15

GPT4V

OpenAI

BiomedGPT_IU

Lehigh University

GPT4V

OpenAI

BiomedGPT_IU

Lehigh University

16

LLM-CXR

KAIST

LLM-CXR*

KAIST

LLM-CXR

KAIST

LLM-CXR

KAIST

Leaderboard on ReXGradient

ReXGradient is a large-scale private test dataset contains 10,000 studies collected from different medical centers in the US.

Rank Model 1/RadCliQ-v1 BLEU BertScore SembScore RadGraph RaTEScore GREEN 1/FineRadScore

1

2024
CheXagent

Stanford

0.674 0.093 0.305 0.366 0.08 0.428 0.241 0.456

2

2024
CheXpertPlus_MIMIC

Stanford

0.777 0.154 0.341 0.442 0.13 0.501 0.52 0.473

3

2024
CheXpertPlus_CheX

Stanford

0.787 0.143 0.361 0.431 0.124 0.476 0.411 0.414

4

2024
CheXpertPlus_CheX_MIMIC

Stanford

0.83 0.169 0.372 0.442 0.154 0.517 0.489 0.465

5

2023
Cvt2distilgpt2_MIMIC

CSIRO

0.866 0.186 0.374 0.46 0.176 0.524 0.514 0.47

6

2023
Cvt2distilgpt2_IU

CSIRO

0.842 0.178 0.395 0.405 0.167 0.52 0.47 0.457

7

2024
MedVersa

Harvard

1.008 0.21 0.431 0.498 0.202 0.527 0.532 0.475

8

2023
RadFM

SJTU

0.775 0.157 0.365 0.392 0.135 0.504 0.406 0.438

9

2023
RaDialog

TUM

0.876 0.188 0.402 0.45 0.158 0.522 0.435 0.456

10

2023
RGRG

TUM

0.888 0.19 0.391 0.47 0.169 0.54 0.487 0.46

11

2023
VLCI_MIMIC

SYSU

0.721 0.157 0.31 0.402 0.122 0.488 0.477 0.455

12

2023
VLCI_IU

SYSU

0.897 0.214 0.365 0.467 0.215 0.573 0.536 0.452

13

2024
LLM-CXR

KAIST

0.507 0.043 0.182 0.142 0.029 0.317 0.044 0.326

14

2024
GPT4V

OpenAI

0.629 0.075 0.214 0.337 0.138 0.47 0.497 0.43

15

2024
BiomedGPT_IU

Lehigh University

0.771 0.099 0.317 0.437 0.157 0.472 0.388 0.451

16

2024
MAIRA-2

Microsoft

0.963 0.205 0.436 0.462 0.187 0.559 0.531 0.475

Leaderboard on MIMIC-CXR Dataset

MIMIC-CXR contains 377,110 images corresponding to 227,835 radiographic studies performed at the Beth Israel Deaconess Medical Center in Boston, MA. We follow the official split of MIMIC-CXR in the following experiments. * denotes the model was trained on this dataset.

Rank Model 1/RadCliQ-v1 BLEU BertScore SembScore RadGraph RaTEScore GREEN 1/FineRadScore

1

2024
CheXagent*

Stanford

0.741 0.113 0.346 0.347 0.148 0.474 0.257 0.355

2

2024
CheXpertPlus_MIMIC*

Stanford

0.788 0.145 0.361 0.375 0.17 0.485 0.311 0.363

3

2024
CheXpertPlus_CheX

Stanford

0.698 0.077 0.314 0.325 0.142 0.469 0.225 0.351

4

2024
CheXpertPlus_CheX_MIMIC*

Stanford

0.805 0.142 0.367 0.379 0.181 0.49 0.305 0.363

5

2023
Cvt2distilgpt2_MIMIC*

CSIRO

0.719 0.126 0.331 0.329 0.149 0.432 0.268 0.362

6

2023
Cvt2distilgpt2_IU

CSIRO

0.613 0.055 0.303 0.191 0.103 0.448 0.164 0.347

7

2024
MedVersa*

Harvard

1.103 0.209 0.448 0.466 0.273 0.55 0.374 0.365

8

2023
RadFM*

SJTU

0.65 0.087 0.313 0.259 0.109 0.45 0.185 0.351

9

2023
RaDialog*

TUM

0.799 0.127 0.363 0.387 0.172 0.485 0.273 0.359

10

2023
RGRG*

TUM

0.755 0.13 0.348 0.344 0.168 0.491 0.273 0.352

11

2023
VLCI_MIMIC*

SYSU

0.68 0.136 0.304 0.305 0.14 0.45 0.256 0.357

12

2023
VLCI_IU

SYSU

0.599 0.075 0.263 0.212 0.109 0.449 0.21 0.347

13

2024
LLM-CXR*

KAIST

0.516 0.037 0.181 0.156 0.046 0.341 0.043 0.307

14

2024
GPT4V

OpenAI

0.558 0.068 0.207 0.214 0.084 0.423 0.161 0.343

15

2024
BiomedGPT_IU

Lehigh University

0.544 0.02 0.192 0.224 0.059 0.36 0.123 0.341

16

2024
MAIRA-2*

Microsoft

0.694 0.088 0.308 0.339 0.131 0.517 0.224 0.359

Leaderboard on IU Xray Dataset

IU Xray contains 7,470 pairs of radiology reports and chest X-rays from Indiana University. We follow the split given by R2Gen. * denotes the model was trained on IU X-ray.

Rank Model 1/RadCliQ-v1 BLEU BertScore SembScore RadGraph RaTEScore GREEN 1/FineRadScore

1

2024
CheXagent

Stanford

0.827 0.116 0.353 0.488 0.139 0.503 0.389 0.574

2

2024
CheXpertPlus_MIMIC

Stanford

0.988 0.178 0.386 0.593 0.169 0.585 0.661 0.622

3

2024
CheXpertPlus_CheX

Stanford

0.92 0.157 0.413 0.495 0.153 0.534 0.541 0.548

4

2024
CheXpertPlus_CheX_MIMIC

Stanford

1.179 0.198 0.453 0.593 0.211 0.618 0.648 0.576

5

2023
Cvt2distilgpt2_MIMIC

CSIRO

1.126 0.199 0.422 0.609 0.209 0.606 0.682 0.608

6

2023
Cvt2distilgpt2_IU*

CSIRO

1.283 0.244 0.482 0.548 0.265 0.62 0.686 0.563

7

2024
MedVersa

Harvard

1.46 0.206 0.527 0.606 0.235 0.65 0.631 0.569

8

2023
RadFM

SJTU

1.187 0.2 0.459 0.566 0.23 0.627 0.615 0.572

9

2023
RaDialog

TUM

1.086 0.201 0.444 0.544 0.205 0.586 0.586 0.543

10

2023
RGRG

TUM

1.174 0.216 0.437 0.602 0.223 0.62 0.665 0.596

11

2023
VLCI_MIMIC

SYSU

0.913 0.139 0.364 0.483 0.22 0.578 0.474 0.488

12

2023
VLCI_IU*

SYSU

1.381 0.268 0.455 0.619 0.288 0.679 0.698 0.551

13

2024
LLM-CXR

KAIST

0.486 0.033 0.186 0.057 0.023 0.28 0.025 0.302

14

2024
GPT4V

OpenAI

0.708 0.076 0.274 0.405 0.146 0.517 0.651 0.55

15

2024
BiomedGPT_IU*

Lehigh University

0.956 0.142 0.375 0.522 0.213 0.543 0.523 0.543

16

2024
MAIRA-2

Microsoft

1.298 0.219 0.477 0.604 0.233 0.627 0.194 0.599

Leaderboard on CheXpert Plus Dataset

CheXpert Plus contains 223,228 unique pairs of radiology reports and chest X-rays from 187,711 studies and 64,725 patients. We follow the official split of CheXpert Plus in the following experiments and use the valid set for evaluation. * denotes the model was trained on CheXpert Plus.

Rank Model 1/RadCliQ-v1 BLEU BertScore SembScore RadGraph RaTEScore GREEN 1/FineRadScore

1

2024
CheXagent

Stanford

0.638 0.123 0.278 0.269 0.125 0.434 0.183 0.341

2

2024
CheXpertPlus_MIMIC

Stanford

0.663 0.14 0.292 0.294 0.134 0.43 0.238 0.344

3

2024
CheXpertPlus_CheX*

Stanford

0.786 0.15 0.342 0.377 0.191 0.487 0.237 0.343

4

2024
CheXpertPlus_CheX_MIMIC*

Stanford

0.808 0.153 0.335 0.404 0.207 0.497 0.274 0.348

5

2023
Cvt2distilgpt2_MIMIC

CSIRO

0.626 0.124 0.267 0.266 0.119 0.42 0.215 0.346

6

2023
Cvt2distilgpt2_IU

CSIRO

0.577 0.084 0.267 0.155 0.098 0.382 0.147 0.332

7

2024
MedVersa

Harvard

0.719 0.129 0.323 0.344 0.147 0.47 0.243 0.343

8

2023
RadFM

SJTU

0.572 0.081 0.235 0.216 0.08 0.396 0.096 0.333

9

2023
RaDialog

TUM

0.709 0.131 0.312 0.353 0.138 0.445 0.211 0.333

10

2023
RGRG

TUM

0.674 0.154 0.315 0.274 0.14 0.453 0.216 0.337

11

2023
VLCI_MIMIC

SYSU

0.589 0.12 0.229 0.251 0.101 0.384 0.165 0.33

12

2023
VLCI_IU

SYSU

0.555 0.106 0.22 0.17 0.094 0.418 0.194 0.339

13

2024
LLM-CXR

KAIST

0.519 0.041 0.162 0.211 0.037 0.321 0.022 0.291

14

2024
GPT4V

OpenAI

0.568 0.081 0.215 0.234 0.082 0.415 0.152 0.339

15

2024
BiomedGPT_IU

Lehigh University

0.552 0.022 0.2 0.241 0.056 0.351 0.118 0.32

16

2024
MAIRA-2

Microsoft

0.788 0.163 0.359 0.355 0.189 0.485 0.273 0.352