Data Imputation Technique

From PHUSE Wiki
Jump to: navigation, search

Introduction

In clinical trials patient data is assessed over a period of time which could span over months or years. Despite the effort to collect complete data for all patients at all time points it is very common that some patients have missing data in some of the assessed variables. They might for example drop out of the trial or they might just not show up for certain visits.

Missing data is an issue especially for the intent to treat population and therefore the missing data is often imputed.

All imputations should be clearly stated in the statistical analysis plan and documented in the define for ADaM. Which technique/rules to choose in a specific case depends on the situation, most often the worst case or the most conservative way is the preferred solution. In SDTM you are allowed to derive some variables (and some are even required by FDA), but you are not allowed to impute any values. Imputations should be done in the ADaM datasets. [1]

Typical derivations that are expected in SDTM are:

  • Age calculations (between BRTHDTC and RFSTDTC)
  • Conversion to standard units (--ORRESU to --STRESU)[1]
  • Baseline flags (--BLFL)[1]
  • Study Day (--DY, --STDY and --ENDY)[1]

Examples of imputations done in ADaM datasets:

  • Imputing Partial Dates: e.g. Start date for a medication or an adverse event when the date are missing or incomplete
  • Substitute a missing baseline value with the screening value
  • Substitute a missing baseline value with the median of the other subjects at baseline

Imputation techniques

Rules defined by ADaM Controlled Terminology [2]

Short Name CDISC Definition
BC Best Case: A data imputation technique which populates missing values with the best possible outcome.
BLOCF Baseline Observation Carried Forward: A data imputation technique which populates missing values with the subject's nonmissing baseline observation.
BOCF Best Observation Carried Forward: A data imputation technique which populates missing values with the subject's best-case nonmissing value.
ENDPOINT Endpoint: A data derivation technique which calculates a subject's analysis end point value.
INTERP Interpolation: A method of imputation involving a missing value that is between known values and is estimated by a function of those known values.
LOCF Last Observation Carried Forward: A data imputation technique which populates missing values with the subject's previous nonmissing value.
MAXIMUM Maximum: A data derivation technique which calculates a subject's maximum value over a defined set of records.
MINIMUM Minimum: A data derivation technique which calculates a subject's minimum value over a defined set of records.
ML Maximum Likelihood: A data imputation technique which populates missing values with estimates that maximize the probability of observing what has in fact been observed.
MOTH Mean of Other Group: A data imputation technique which populates missing values with the mean value from a comparator or reference group.
MOV Mean Observed Value in a Group: A data imputation technique which populates missing values with the mean value observed in a group of subjects.
POCF Penultimate Observation Carried Forward: A data imputation technique which populates missing values with the subject's next-to-last nonmissing value.
SOCF Screening Observation Carried Forward: A data imputation technique which populates missing values with the subject's nonmissing screening observation.
WC Worst Case: A data imputation technique which populates missing values with the worst possible outcome.
WOCF Worst Observation Carried Forward: A data imputation technique which populates missing values with the subject's worst-case nonmissing value.
WOV Worst Observed Value in a Group: A data imputation technique which populates missing values with the worst value observed in a group of subjects.

Other commonly used techniques in clinical trials

Guidelines on missing data

References