ReX-MLE

What is ReX-MLE?

ReX-MLE is a benchmark of 20 tasks drawn from high-impact medical imaging competitions spanning segmentation, detection, classification, image quality assessment, and generative enhancement. Each challenge pairs standardized data-preparation scripts and official evaluation code with end-to-end agent runs, enabling reproducible measurement of autonomous ML systems under realistic constraints.

Please refer to the Github repository for more details.

ReX-MLE Leaderboard

Rank	Model	ReX-MLE
1	R&D-Agent Microsoft	12.15%
2	AIDE Weco AI	9.05%
3	ML-Master SJTU	4.53%

Performance Overview

Average performance across task categories (segmentation, detection, classification, and image quality/enhancement). Values correspond to the primary metric for each challenge (failures shown as zero).

Example Overview

Browse example AIDE agent runs from the ISLES'22 task.

ISLES'22 - GPT-5 ISLES'22 - Gemini ISLES'22 - Claude

Leaderboard

Agent performance across the ReX-MLE suite with primary metric values and percentile ranks (Competition Rank) separated.

Challenge	Metric (↑)	AIDE		ML-Master		R&D-Agent		Human
		Value	Rank	Value	Rank	Value	Rank	Value
Segmentation Tasks
ISLES'22	Dice	0.04	0%	0.00	0%	0.02	0%	0.79
NeurIPS-CellSeg	F1	0.04	0%	0.04	0%	0.36	0%	0.88
PANTHER-T1	Dice	0.33	16%	0.13	8%	0.16	8%	0.73
PANTHER-T2	Dice	0.09	10%	0.05	8%	0.28	58%	0.53
PUMA-T1-Seg	Dice	FAIL	--	0.00	0%	0.00	0%	0.78
PUMA-T2-Seg	Dice	0.00	0%	0.00	0%	0.00	0%	0.78
SEG.A	Dice	0.02	0%	0.02	0%	0.00	0%	0.92
TopBrain-CTA	Mean Dice	0.03	2%	0.26	3%	0.08	2%	0.79
TopBrain-MRA	Mean Dice	0.01	10%	0.26	0%	0.50	0%	0.81
TopCoW-CTA-Seg	Mean Dice	0.09	0%	0.25	3%	0.49	2%	0.87
TopCoW-MRA-Seg	Mean Dice	0.11	0%	0.48	0%	0.73	3%	0.88
Detection Tasks
DENTEX	AP	0.09	0%	0.08	0%	0.09	0%	0.40
PUMA-T1-Det	F1	0.02	0%	0.08	0%	0.06	0%	0.66
PUMA-T2-Det	F1	FAIL	--	0.00	0%	0.01	0%	0.27
TopCoW-CTA-Det	IoU	0.67	38%	0.65	25%	0.70	56%	0.79
TopCoW-MRA-Det	IoU	0.66	14%	0.69	14%	0.19	14%	0.85
Classification Tasks
TopCoW-CTA-Cls	Accuracy	0.33	33%	0.10	0%	0.28	50%	0.73
TopCoW-MRA-Cls	Accuracy	0.33	25%	0.33	25%	0.09	0%	0.89
Image Quality & Enhancement Tasks
LDCT-IQA	Score	2.62	33%	2.50	0%	2.66	50%	2.74
USenhance	LNCC	0.11	0%	0.13	0%	FAIL	--	0.91
Overall
Overall Mean Percentile	--	--	9.05%	--	4.53%	--	12.15%	--

ReX-MLE

Autonomous Agent Benchmark for Medical Imaging Challenges

Read the Paper

What is ReX-MLE?

ReX-MLE Leaderboard

Performance Overview

Example Overview

Leaderboard

Challenge Descriptions