Day 41

MATH 313: Survey Design and Sampling

Introduction to Bootstrap Methods

  • Overview of Traditional Estimation
    • Traditionally, estimators use variance formulas for estimation.
    • Standard errors are derived from these variances to construct confidence intervals, typically 95%.
  • Challenges with Traditional Methods
    • Bias in variance estimates can occur under complex survey conditions.
    • Difficulties arise in deriving closed-form expressions for variance in surveys involving clustering, stratification, and weighting.

Why Bootstrap?

  • Advantages of Bootstrap
    • Bootstrap methods do not rely on variance formulas, reducing potential biases.
    • These methods use resampling techniques to estimate the distribution of sample statistics directly.
  • Bootstrap Approach
    • Involves drawing repeated samples from the survey data with replacement.
    • Each resample is used to calculate estimates, building an empirical distribution of the estimator.

Implementing Bootstrap in R

  • Setting Up in R
    • Utilize the survey package to create a survey design object.
    • Convert the design object to a bootstrap design for resampling.
  • Bootstrap Computation
    • Generate bootstrap replicates of the statistic of interest.
    • Analyze the variability and stability of these estimates to gauge their reliability.

# Create a vector of size 10
obs <- c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50)

# Define the number of bootstrap samples
n_bootstraps <- 50

# Generate bootstrap samples and compute their means
set.seed(123)  # For reproducibility
bootstrap_means <- replicate(n_bootstraps, {
  sample_data <- sample(obs, replace = TRUE)
  mean(sample_data)
})

# Calculate the original mean
original_mean <- mean(obs)
original_mean
[1] 27.5
bootstrap_means # collection of bootstrapped means
 [1] 29.0 36.5 32.0 32.0 30.5 21.5 30.5 31.5 28.0 30.0 30.5 32.5 29.5 28.0 28.5
[16] 36.5 37.0 35.5 37.0 31.0 21.0 31.5 29.0 14.5 31.5 24.0 28.0 21.5 30.5 31.0
[31] 21.5 19.0 30.0 22.0 36.0 34.5 25.5 31.5 29.0 33.5 22.5 21.5 27.5 19.5 31.5
[46] 27.5 31.0 35.0 34.5 25.5
mean(bootstrap_means) # mean of the bootstrapped means
[1] 28.97

Visualizing Bootstrap Results

  • Dot Plots
    • Use dot plots to visualize the distribution of bootstrap estimates.
    • Adjust dot size and stack direction to avoid overlap and improve clarity.
  • Density Curves
    • Overlay density curves on dot plots to highlight the overall distribution shape and central tendency.

library(survey)
data(api) 
design <- svydesign(ids = ~dnum, weights = ~pw, data = apiclus1)
summary(design)
1 - level Cluster Sampling design (with replacement)
With (15) clusters.
svydesign(ids = ~dnum, weights = ~pw, data = apiclus1)
Probabilities:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.02954 0.02954 0.02954 0.02954 0.02954 0.02954 
Data variables:
 [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
 [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
[13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
[19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
[25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
[31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
[37] "api.stu"  "fpc"      "pw"      

bootstrap_design <- as.svrepdesign(design, type = "bootstrap", replicates = 1000)
summary(bootstrap_design)
Call: as.svrepdesign.default(design, type = "bootstrap", replicates = 1000)
Survey bootstrap with 1000 replicates.
Variables: 
 [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
 [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
[13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
[19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
[25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
[31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
[37] "api.stu"  "fpc"      "pw"      
mean_estimate <- svymean(~api00, bootstrap_design)
mean_estimate
        mean     SE
api00 644.17 24.071

set.seed(113)
replicate_estimates <- replicate(1000, expr = {
    single_sample <- sample(1:nrow(apiclus1), replace = TRUE)
    svymean(~api00, svydesign(ids = ~dnum, weights = ~pw, data = apiclus1[single_sample, ]))
}, simplify = FALSE)

replicate_estimates[1:6] 
[[1]]
        mean     SE
api00 634.38 24.291

[[2]]
        mean     SE
api00 642.17 31.605

[[3]]
       mean     SE
api00 637.4 24.491

[[4]]
        mean     SE
api00 638.52 20.391

[[5]]
        mean     SE
api00 638.14 24.211

[[6]]
        mean     SE
api00 647.31 23.668

# Convert the list of estimates to a vector
estimate_values <- sapply(replicate_estimates, function(est) coef(est))
estimate_values %>% head(100)
   api00    api00    api00    api00    api00    api00    api00    api00 
634.3825 642.1694 637.4044 638.5246 638.1366 647.3115 647.3661 639.8470 
   api00    api00    api00    api00    api00    api00    api00    api00 
648.5410 632.2568 642.6393 638.1530 644.0874 638.0820 643.3005 636.0000 
   api00    api00    api00    api00    api00    api00    api00    api00 
647.8142 636.3005 643.9344 646.3224 639.5847 645.2842 657.2678 640.7158 
   api00    api00    api00    api00    api00    api00    api00    api00 
656.1694 643.5847 642.3607 642.7760 649.5137 653.8361 656.1967 643.5301 
   api00    api00    api00    api00    api00    api00    api00    api00 
645.5956 641.1585 652.3661 647.2131 636.3989 650.8087 648.2350 647.2404 
   api00    api00    api00    api00    api00    api00    api00    api00 
643.6120 645.6721 639.9016 637.1694 647.3333 643.8689 640.3060 642.0601 
   api00    api00    api00    api00    api00    api00    api00    api00 
645.7486 642.7814 651.0765 638.4645 647.2514 653.8470 638.2404 645.8689 
   api00    api00    api00    api00    api00    api00    api00    api00 
645.3224 634.3169 645.5246 649.1913 648.2077 649.2842 644.6230 644.6503 
   api00    api00    api00    api00    api00    api00    api00    api00 
644.8579 634.7268 645.9727 644.9836 663.5902 642.2568 640.4372 631.4372 
   api00    api00    api00    api00    api00    api00    api00    api00 
641.8197 624.6503 647.1585 636.7705 652.3825 636.6995 638.4317 649.0710 
   api00    api00    api00    api00    api00    api00    api00    api00 
641.4426 634.6557 637.2240 638.2678 643.2623 636.7322 645.6885 647.5137 
   api00    api00    api00    api00    api00    api00    api00    api00 
631.7650 640.7213 643.8962 647.0492 634.2896 648.8579 641.0000 635.7049 
   api00    api00    api00    api00 
653.0164 643.6940 645.6612 640.1421 

# Calculate mean and SE
bootstrap_mean <- mean(estimate_values)
bootstrap_se <- sd(estimate_values) 

# Plotting the bootstrap estimates
library(ggplot2)
ggplot(data = data.frame(estimate_values), aes(x = estimate_values)) +
    geom_dotplot(binwidth = 0.8, stackdir = "up", dotsize = 0.5, fill = "maroon", col = "gold") +
    geom_density(aes(y = ..scaled..), fill = "lightblue", alpha = 0.2, adjust = 1) +  
    geom_vline(xintercept = bootstrap_mean, color = "blue", linetype = "dashed") +
    annotate("text", x = bootstrap_mean + 5, y = 0.8, label = sprintf("Mean = %.2f\nSE = %.2f",
bootstrap_mean, bootstrap_se), color = "blue") +
    xlab("Bootstrap Estimates") +
    ggtitle("Distribution of API Score Estimates")

The American Housing Survey tracks housing characteristics in the U.S., including ownership costs and house values across 47 metropolitan statistical areas (MSAs). The 2002 survey sampled 13 MSAs, providing data on typical monthly ownership costs and house values for 2002 and 1994.

  1. Mean Value 2002: Simulate a bootstrap confidence interval for the mean typical house value in 2002.
  2. Median Value 2002: Simulate a bootstrap confidence interval for the median typical house value in 2002.
  3. Ratio of Mean Values 2002 to 1994: Simulate a bootstrap confidence interval for the ratio of mean house values from 2002 compared to 1994.