Replicability of machine learning models is vital for clinical research

Research Group for IT-innovations in Healthcare

8 min readOct 31, 2024

Where we provide best practices how to efficiently make your research replicable

Introduction

It is impressive to see the amount of interesting research models that are produced in hospitals across the world. An interesting example from our own context was a model that screens patients for an invasive coronary angiography to check if an important artery close to the heart is blocked. Only if the model yields a high chance of a blocked artery, a full procedure is ordered. The model would reduce the number of unnecessary angiographies, which is certainly not without risks.

This model, trained on historical data, showed so much promise that it was published and the hospital wanted to integrate the model into clinical practice. Yet there were some doubts from the hospital about the generalizability and reproducibility of the model, and therefore we setup a project to check the quality of the model and to further improve their development cycle. As a first step in this implementation process, we tried to replicate the exact model from the published article. However, this turned out to be very challenging if not impossible. Even though the original data was available, and used software including unique version numbers were mentioned in the article, some key elements were missing. Where things mainly went awry was in the model configuration.

We were provided with the final Python scripts used to train the original model, but important details such as which options were tried for hyperparameter tuning where not available. Although the scripts provided some information, but we could not replicate the model and get the same performance numbers as stated in the article. Not being able to replicate the model prevented it from having any clinical impact. In this blogpost we want to explore what replication exactly is and provide the learned lessons from this case example to ensure accurate replication for hospitals developing AI algorithms.

What is replication in this context?

The rise of using digital tools to perform research led Claerbout and Karrenbach (1992) to spearhead the reproducible research movement. The discussion around reproducibility is active and ongoing, Raff (2019) even refer to it as the “AI/ML reproducibility crisis”. Note that the terms replication and reproduction mix in this discussion. In this post, we focus on the term replication: replicating a specific model that was part of a research project. This is not the same as reproduction, which means independent reproduction of the results in a different setting (Drummond (2009)). The latter is used to scientifically prove a model generalizes outside the original research context; the former is used to replicate a specific model, for example for bringing it to production.

To achieve replicability, the following aspects need to be documented:

Data. Training or validating a model on different sets of data yields different results. In addition, choices regarding pre-processing of data also influence the outcome of a model. Ideally we would have access to the raw initial data, and all the pre-processing steps that are needed to produce the final training data.
Software environment. The particular set of, mainly digital, tools used to create the model must be known: which Python version, which version of the analysis library, which preprocessing packages where used, etc. Any changes to any of the packages used to build the model could yield differences that make exact replication impossible.
Model configuration. Which types of models where tried: trees, regression, neural networks, etc. For each of these options, what hyperparameters where used: learning rate, number of neurons in case of neural networks, maximum number of trees in case of gradient boosting? These settings impact the model, and thus impact replication.

To get what we would call limited replicability we need to only reproduce the final model that was used in the article. For full replicability the data/tooling/model configuration of the entire research process should be clear: how did we get to the final model, what did we try, why did we make certain decisions, and what where things we tried that did not work?

How do we achieve limited replication?

To achieve limited replicability, each of the aspects above need to be covered. For each aspect we will provide our best practices.

Data

The absolute minimum needed for replication is the original data that was used as input for the model. It would be better if we would have access to the raw data that was gathered, and each of the subsequent pre-processing steps that are required to generate the training data for the model. For example, what data is regarded as outliers and what was done with those values? In addition, where there missing values and how where they imputed? Ideally, the pre-processing steps would be in the form of a (Python) script that automates this process.

Software environment

To achieve replicability, we need an identical software environment that was used to generate the models. A number of options exist to make this happen:

A basic option would be to provide a full list of the packages that were installed at the moment of fitting in addition to the exact version of python. This enables someone to check whether or not their environment matches that of the writers.
A better option however would be to build a virtual environment in which the model is trained. Virtual environments allow the user to build the environment using configuration files, which makes it a lot easier for other users to take these configuration files and replicate the environment. In addition, it separates the analysis environment into its own virtual environment, not interfering with software installed for other projects. A downside of this approach is that it does not fully automate generating the environment, for example the Python distribution or operating system used cannot be automated.
The best option in our opinion is to use (Docker) containers in which to run the analyses. This automates all aspects of replicating an identical software environment including the Python version and all Python packages needed. In the case of docker, the configuration of the software environment is stored in so called Docker file. A lot of base docker files, for example Anaconda and Tensorflow, are available, often only requiring a small number of additional packages to be installed. Especially for TensorFlow workflows a docker container takes a lot of headaches away in lining up all the version numbers of the required packages. In addition, it is easy to transfer this exact environment to a cloud environment. A downside of this approach, and to some extent the virtual environment, is that users need to acquire the skills needed to manage these additional tools. In our opinion, this is a time investment worth making.

Model configuration

With our data and software environment taken care of, the last piece of the puzzle is how to build and configure the model. This includes which exact function is used to generate the model and which settings where used. In addition, if the article mentions hyperparameter tuning the options considered for each parameter should be given.

All this extra information could be provided as additional text, tables and figures, but we feel this could make an article quite cluttered. A better approach would be to provide this information in code as supplementary material. This could be in the form of a (Python) script that takes the input data and generates the final model presented in the article. Running the script in the provided software environment with the appropriate data would allow a third party to easily replicate the exact model.

An even better option would be to use an interactive analyses notebook or Jupyter Notebook. This kind of notebook allows mixing both text and (Python) code. Essentially, we could write an article as it exists today and mix in the code needed to actually perform the analysis. This means that given the tool environment we could simply ‘press play’ inside a notebook containing an article to replicate the work. The interactive Jupyter Notebook could be provided as supplementary online material for the article.

Wrapping up

So, an optimal solution according to us would be to write the paper in an interactive notebook style, including the container definition file and the data, which would make it easy to replicate a model. Both the notebook and the container definition are text files which can be stored in a version management system like Git. Essentially, a new user would simply have to clone the Git repository, acquire the data (if not provided in the github repository), load the container and press play inside the notebook. An editor like Visual Studio Code makes this process quite easy. To have a practical example of how all our recommendations pan out, we created an example project. The following Git repository demonstrates many of the concepts we discussed in this section of the blogpost.

Extending to full replicability

But deeply understanding how a model was built goes beyond just the final model used to generate the results in an article. The path that researchers took to get to the final model also provides a lot of value and insights. For example, when you want to extend the work, it is worthwhile to know what has been tried before (and might have failed). Also, the path can shed light on the thoroughness of the research and the choices that were made.

One way to document the path taken would be to use a lab notebook akin to the ones used in chemistry, Claerbout and Karrenbach also refer to this practice. Essentially, the researcher writes an entry in the lab notebook for each day that they work on the model. They document what they do, what their results are, and how they interpret them. This can be very nicely done using a Jupyter notebook, see this link as an example. Having this record provides a number of advantages in our opinion:

It helps researchers to constantly reflect on their work. Especially for unexperienced researchers, it can be very easy to get lost in details and spend months doing research without reflecting back on the grander scale: what is the problem we are trying to solve. Swapping between these abstraction levels (detailed code, abstract discussions on what the results mean) is a key skill for researchers.
It helps researchers get up to speed with their own work again. Especially when you spend a week away, say for a conference, it can be hard to pick up the project again. With a lab notebook in place you can simply read the last few entries you wrote and get up to speed again on what you were trying to accomplish.
The project can more easily be transferred to or shared with someone else. Not only can your successor see the final product, but also all the steps that where taken in between. This prevents them from having to reinvent the wheel all over again.
When there are questions about the validity of the research, e.g. by a reviewer or external auditor, a lab notebook provides more insights into what was actually done. Either this satisfies the external party, or can greatly help in identifying what went wrong. This is a win both ways as far as we are concerned.

As with all the advice we gave, implementing such a lab notebook takes time and effort. However, we strongly believe this effort will be well worth the investment it in the long run.

Acknowledgements

This work is part of the Medical Data Analytics Center project, a collaboration between the Isala hospital, the software company AppBakkers, Windesheim University of Applied Sciences, and has been made possible through a grant from TechForFuture. Special thanks goes out to Guido Versteeg, Joris van Dijk, and Gido Hakvoort for providing feedback on drafts of this article.

Finally we would like to thank the students that did most of the work in trying to get the model mentioned at the start of the article replicated. Thanks Mees Doeleman, Bram van der Mars, Mischa Luijken, Nick de Bruin, en Niels Aberson.