· Theals · Methods · 2 min read
Clinical ML reporting checklist that survives reviewer pressure
A practical checklist for clinical prediction and imaging ML so results are interpretable, calibrated, and defensible.
Many clinical ML papers fail for predictable reasons. The model can be strong and still be clinically unusable.
Use this checklist before submission.
1. The problem definition is operational
State:
- population
- prediction target
- when the prediction is made
- what decision would change if the prediction is correct
If the decision point is unclear, clinical value is unclear.
2. Data provenance and labeling are explicit
A reviewer needs to know:
- how cases were identified
- what exclusions were applied
- what time windows were used
- how the label was constructed
- how label leakage was prevented
If leakage is plausible, performance is not trusted.
3. Splits prevent leakage and reflect deployment reality
Minimum expectations:
- patient-level split, not row-level split
- temporal split when practice patterns change
- site split when multi-site generalization is claimed
If external validation is not possible, state it, then define the next best test.
4. Baselines are honest
Include:
- a simple baseline
- a conventional comparator from the clinical literature when it exists
- ablations that show what actually drives performance
Weak baselines signal cherry-picking.
5. Performance reporting is clinically legible
Report:
- discrimination appropriate to the task
- calibration, not only AUC
- threshold behavior tied to sensitivity, PPV, and the intended use case
High AUC with poor calibration is common and clinically dangerous.
6. Subgroups and failure modes are not optional
Provide:
- performance by clinically relevant subgroups when sample size allows
- concrete failure examples
- sensitivity analyses for label noise and missingness when relevant
This is where clinical reviewers decide whether the work is clinically grounded.
7. Reproducibility is part of the result
At minimum:
- exact split definition and random seed handling
- data derivation logic for labels and features
- code used to train and evaluate
- a short model card describing intended use and limits
If a competent reader cannot reproduce the evaluation logic, reviewers assume the result is brittle.
Final gut-check
If the manuscript reads like a benchmark report, clinical reviewers will treat it as non-credible. Write it like a clinical methods paper.