How is multiple imputation used?

Data is often incomplete. In clinical trials, patients may drop out, respondents may skip survey questions, and schools and governments may withhold certain results. When data is missing, standard statistical methods such as calculating averages become ineffective.

“We can’t handle missing data any more than we can divide by zero,” explains Stef van Buuren, a professor at Utrecht University.

Imagine that you are testing a drug to lower blood pressure. You collect data weekly, but some participants drop out of the study, frustrated by the lack of improvement. You can exclude these participants from the analysis, leaving only those who completed the study—this is called a complete case analysis. However, this creates an error: excluding dissatisfied patients skews the results, making the treatment appear more effective than it actually is.

It is difficult to avoid such bias. Previously, researchers used different techniques with serious limitations. But in the 1970s, statistician Donald Rubin proposed a method that required a lot of computing power. He suggested that a few guesses could be made about the missing data and used for analysis. While there was initial resistance to the idea, it has become a mainstream approach to working with incomplete data and is growing in popularity thanks to modern machine learning techniques.

Outside of statistics, “imputation” means assigning responsibility. In statistics, this refers to the assignment of missing data. For example, if a person does not indicate their height, they may be assigned the average height of their gender. This method, known as single imputation, emerged in the 1930s and became preferred by the 1960s. Rubin modified this approach, finding a flaw in it: over-reliance on guesswork.

In the 1970s, Donald Rubin invented and popularized a new statistical method for dealing with missing data. It was controversial at first, but is now used in many scientific fields.

While at Harvard, Rubin switched from psychology to computer science and became interested in the problem of missing data. He observed that single imputation creates overconfidence and underestimates uncertainty. Statisticians could correct this, but the solutions were complex and specialized for each situation. Rubin sought to create a universal and accurate method.

After defending his dissertation in 1971, Rubin began working at Princeton. When he was tasked with analyzing a survey with missing data, he proposed multiple imputation—making multiple copies of the data and randomly selecting an assumption about the missing data for each one. This would allow uncertainty to be taken into account in the forecasts.

Multiple imputation involves creating multiple versions of a data set and filling each one with random values from assumptions. You can then analyze each version and get different predictions. By combining them using special rules, it is possible to obtain more accurate results and estimate uncertainty. This method has become important to regulatory agencies such as the FDA.

Example of processing tabular data

You are testing a new drug to lower blood pressure. Every week you measure the blood pressure (BP) of patients, but some stop showing up for testing. What will you do?

PATIENT	WEEK 1	WEEK 2	WEEK 3
A	137.7	135.3	134.1
B	136.4	134.2	132.0
C	138.9	138.7	Left the study

Option 1 – Exclude Patient C from the Study Completely

PATIENT	WEEK 1	WEEK 2	WEEK 3
A	137.7	135.3	134.1
B	136.4	134.2	132.0
C	138.9	138.7	Left the study

Option 2 – Assume Patient C's blood pressure remains constant

PATIENT	WEEK 1	WEEK 2	WEEK 3
A	137.7	135.3	134.1
B	136.4	134.2	132.0
C	138.9	138.7	138.7

Option 3 – Assume that Patient C's blood pressure is similar to Patient A's

PATIENT	WEEK 1	WEEK 2	WEEK 3
A	137.7	135.3	134.1
B	136.4	134.2	132.0
C	138.9	138.7	134.1

In the early 1970s, multiple imputation was met with skepticism. Scientists wondered why they should choose anything other than their best guess. In addition, multiple imputation required a lot of computing power, which was problematic in the punch card era.

However, Rubin continued to promote his idea. He consulted for government agencies that could afford to store large amounts of data. By the 1990s, technology had advanced and multiple imputation became available to a wider range of researchers. One of them was van Buuren, who released software to use this method.

In 2010, the FDA recommended multiple imputation in medical research, making it the standard in the field. Despite the emergence of other methods, multiple imputation remains the most universal and applicable in various situations.

Modern machine learning-based programs have expanded the capabilities of multiple imputation, allowing us to work with more complex data. However, some scientists still doubt the rigor of these new methods.

However, Rubin's approach remains a primary tool for analyzing missing data in a variety of fields, helping researchers interpret results more accurately and avoid misrepresentation.

All this and much more — TG “Math is not for everyone”