

Linear Model selection¶

Intro¶

Recently I was involved in a study that I collaborate with people from the program of international nutrition who have designated specialty on epidemiology and public health. Epidemiologists often attempt to answer questions such as, "does a treatment cause a significant effect on particular outcome in particular population?", "is a particular variable different in subgroups of a population?", or "is a particular variable associated with another in a population?" Scientists often perform observational studies over hundreds or thousands of people through a long period.

The study that I was involved in, was an interventional study done in Ghana. In this study, pregnant women were given either of two nutritional supplement IFA and LNS until infants were 6 months of age. Infants were then received the same supplementation from 6 to 18 month. In this particular case, we want to answer the question that whether the two supplementations have different effect on infants growth at 18 month. Thus the main effect model would be as following.

$y\ =\ \beta_0\ +\ \beta_1 x$

Where $y$ is any outcome variable of interest and $t$ is the treatment.

Observational studies on human subjects often have large variances. If those variances were not controlled well, target effect of interest might be masked and you may make type II error. Thus an anjusted model is usually used additional to the unadjusted model shown above, as following.

$y\ =\ \beta_0\ +\ \beta_1 x\ +\ \gamma_1 z_1\ +\ \gamma_2 z_2\ +\ ...\ +\ \gamma_n z_n$

Here I will use a subset of the data to demonstrate how to do model selection for covariates in R.

Data¶

A subset of the data is included in a package called mlms posted on my github account. The dataset has 80 observations of infants with their growth outcome variables at 18 months of age, treatment group, and mother's baseline characteristics at enrollment.

devtools::install_github("zhuchcn/mlms")
library(mlms)
data(growth）
head(growth)

#>    waz   laz   hcz treatment  momht mbhb      asset1 gaatdel totschyrs sex_updated
#> 1 -3.00 -1.45 -2.35       IFA 160.35  112  0.09214962    38.6         3           0
#> 2 -0.88 -0.45 -1.65       IFA 170.05  127  1.30253065    39.3        12           0
#> 3 -0.34 -1.20 -0.21       IFA 154.65  118  0.85202587    36.3         6           0
#> 4 -0.86 -0.38 -1.13       LNS 162.25  125  0.59085160    37.8         9           0
#> 5 -1.82 -1.87 -2.19       LNS 158.60  118 -1.59396219    38.6         4           1
#> 6 -0.40 -0.99 -0.42       IFA 158.70  108 -0.95266008    41.3         6           1

The first three columns are three growth outcome variables WAZ (weight for age z-score), LAZ (length for age z-score), and HCZ (head circumference z-score). The forth column is the nutritional supplement treatment, either IFA or LNS. Column 5 to 10 are mother's baseline characteristics. The description of each column can be found from the data's documentation using the following command.

?growth

Unadjusted model¶

We are now fitting an unadjusted model between treatment and the laz. We are doing so with the lm function.

fit1 = lm(waz ~ treatment + 1, data = growth)
fit1

#> Call:
#> lm(formula = waz ~ treatment + 1, data = growth)
#> 
#> Coefficients:
#>  (Intercept)  treatmentLNS  
#>      -0.9382        0.2438

To access the statistics of the linear model result, we use the summary function.

sum_fit1 = summary(fit1)
sum_fit1

#> Call:
#> lm(formula = laz ~ treatment + 1, data = growth)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.98775 -0.58025 -0.08275  0.69225  2.49225 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -0.9723     0.1606  -6.055 4.63e-08 ***
#> treatmentLNS   0.3400     0.2271   1.497    0.138    
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 1.016 on 78 degrees of freedom
#> Multiple R-squared:  0.02794,    Adjusted R-squared:  0.01548 
#> F-statistic: 2.242 on 1 and 78 DF,  p-value: 0.1384

As shown above, the p value is 0.138 in this model.

If you want to only access the statistic part.

sum_fit1$coefficients

#>              Estimate Std. Error   t value     Pr(>|t|)
#> (Intercept)  -0.97225  0.1605677 -6.055078 4.626456e-08
#> treatmentLNS  0.34000  0.2270770  1.497289 1.383539e-01

Adjusted model¶

We are next going to fit an adjusted model variables in column 5-10 as potential covariates. We are first going to test the correlation between the covariates and the outcome variable laz using cor.test function. In the code chunk below, we loop through column 5 to 10 and apply the cor.test between the laz and the covariate, and only output the p value.

sapply(growth[5:10], function(covar){
    cor.test(growth$laz, covar)$p.value
})

#>      momht        mbhb      asset1     gaatdel   totschyrs sex_updated 
#> 0.00192442  0.70791953  0.24337094  0.04841732  0.78633435  0.12737051

From the result above we see the two variables momht (mother's height) and gaatdel (gestational age at enrollment) are correlated with laz (p < 0.1), thus they should be included in the complete model.

We now can fit a complete model.

fit2 = lm(laz ~ treatment + momht + gaatdel + 1, data = growth)

#> Call:
#> lm(formula = laz ~ treatment + momht + gaatdel + 1, data = growth)
#> 
#> Coefficients:
#>  (Intercept)  treatmentLNS         momht       gaatdel  
#>    -12.78807       0.17006       0.05754       0.06975

And get the statistics using summary

sum_fit2 = summary(fit2)
sum_fit2

Call:
lm(formula = laz ~ treatment + momht + gaatdel + 1, data = growth)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5506 -0.6183 -0.1242  0.5209  2.3776 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -12.78807    3.70382  -3.453 0.000933 ***
treatmentLNS   0.17006    0.23103   0.736 0.464067    
momht          0.05754    0.02130   2.702 0.008590 ** 
gaatdel        0.06975    0.05797   1.203 0.232821    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9793 on 72 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.1481,    Adjusted R-squared:  0.1126 
F-statistic: 4.171 on 3 and 72 DF,  p-value: 0.008817

We can see that momht is unfortunately significant, and there isn't a significant treatment effect (p = 0.464).

mlms¶

Above I demonstrated how to perform model selection for a covariates adjusted model for a single outcome variable. One can simply write some loops and generalize this idea to multiple variables. The mlms is an R package with the model selection implemented that allows users to perform model selection for multiple outcome variables quickly as below.

# A data.frame for the outcome variables
X = growth[,1:3]
# The design matrix for the primary effect
design = model.matrix(~ treatment, data = growth)
# A data.frame for the covariates
Z = growth[,5:10]
# the coefficient for the major effect. This must be one of the column
# names in design
coef = "treatmentLNS"
fit = fit_mlms(X, design, Z, coef)
summary(fit)

#> $bivars
#>          momht       mbhb    asset1    gaatdel totschyrs sex_updated
#> waz 0.16195149 0.73601858 0.3755778 0.59073138 0.8681225   0.3665692
#> laz 0.00192442 0.70791953 0.2433709 0.04841732 0.7863344   0.1273705
#> hcz 0.43447850 0.09449692 0.3143394 0.75104026 0.2396203   0.3651060
#> 
#> $covars
#>           momht       mbhb asset1   gaatdel totschyrs sex_updated
#> waz          NA         NA     NA        NA        NA          NA
#> laz 0.008589524         NA     NA 0.2328213        NA          NA
#> hcz          NA 0.09057698     NA        NA        NA          NA
#> 
#> $models
#>   var      model   Estimate Std. Error   CI 2.5 %  CI 97.5 % df   t value  Pr(>|t|)
#> 1 waz unadjusted 0.24375000  0.2353540 -0.2248039  0.7123039 78 1.0356738 0.3035546
#> 2 waz   adjusted 0.24375000  0.2353540 -0.2248039  0.7123039 78 1.0356738 0.3035546
#> 3 laz unadjusted 0.34000000  0.2270770 -0.1120756  0.7920756 78 1.4972894 0.1383539
#> 4 laz   adjusted 0.17005553  0.2310253 -0.2904851  0.6305961 72 0.7360904 0.4640671
#> 5 hcz unadjusted 0.07858974  0.2159208 -0.3513634  0.5085429 77 0.3639749 0.7168743
#> 6 hcz   adjusted 0.10390902  0.2137643 -0.3218394  0.5296574 76 0.4860916 0.6283009
#> 
#> attr(,"class")
#> [1] "summary.mlms"