Day 42

MATH 313: Survey Design and Sampling

Imputation Techniques in Data Analysis

Missing data is common in surveys and experiments. Imputation assigns values to missing cells, improving data usability.

  • Random selection (from the sample or subgroup).
  • Replacement by group mean.
  • Sequential or nearest-neighbor imputation.
  • Regression-based predictions.

Hot Deck Imputation Techniques

Random Selection

  • Entire sample: Select random value from the whole sample.
  • Subgroup: Select random value within a homogeneous group (e.g., gender).

Mean Replacement

  • Replace missing value with the group mean.
  • Drawback: Reduces sample variability, underestimates standard errors.

Example: Student Study Hours Data

Method Sample Size Sample Mean Std Dev Std Error
I. None 130 11.400 6.790 0.569
II. Random from entire sample 135 11.526 6.762 0.555
III. Random from females 135 11.689 6.930 0.548
IV. Mean of females 135 11.436 6.665 0.537

Adjusting Variances After Imputation

  • Imputation reduces standard errors artificially.
  • Adjust variance to account for imputation error: \[ \text{Adjusted Variance} = \text{Variance} \times \left(1 + 2 \frac{k}{n}\right) \]

Example (Method III):

\[ \text{Adjusted Variance} = (0.548^2) \times 1.07 = 0.3213 \]

Advanced Imputation: Regression and Nearest-Neighbor

  • Sequential Imputation:
    • Replace missing value with the last observed value.
    • Works well with ordered data but can lead to bias if missing values cluster.
  • Nearest-Neighbor:
    • Use similar cases (based on variables like GPA, gender, etc.) to impute values.
  • Regression Imputation:
    • Build predictive models using other variables.

Demonstrating Imputation with mice in R

#install.packages("mice")
library(mice)

# Example data with missing values
data <- data.frame(
  StudyHours = c(12, 10, 8, NA, 14, NA, 9),
  Gender = c("Male", "Female", "Male", "Female", "Female", "Male", "Female")
)
knitr::kable(data)
StudyHours Gender
12 Male
10 Female
8 Male
NA Female
14 Female
NA Male
9 Female
# Inspect missing data pattern
md.pattern(data)

  Gender StudyHours  
5      1          1 0
2      1          0 1
       0          2 2

# Apply multiple imputation
imputed_data <- mice(data, method = "pmm", m = 5, print = FALSE)

# View imputed data
completed_data <- complete(imputed_data)
completed_data %>% knitr::kable()
StudyHours Gender
12 Male
10 Female
8 Male
12 Female
14 Female
12 Male
9 Female