Linear Model selection¶
Intro¶
Recently I was involved in a study that I collaborate with people from the program of international nutrition who have designated specialty on epidemiology and public health. Epidemiologists often attempt to answer questions such as, "does a treatment cause a significant effect on particular outcome in particular population?", "is a particular variable different in subgroups of a population?", or "is a particular variable associated with another in a population?" Scientists often perform observational studies over hundreds or thousands of people through a long period.
The study that I was involved in, was an interventional study done in Ghana. In this study, pregnant women were given either of two nutritional supplement IFA and LNS until infants were 6 months of age. Infants were then received the same supplementation from 6 to 18 month. In this particular case, we want to answer the question that whether the two supplementations have different effect on infants growth at 18 month. Thus the main effect model would be as following.
Where y is any outcome variable of interest and t is the treatment.
Observational studies on human subjects often have large variances. If those variances were not controlled well, target effect of interest might be masked and you may make type II error. Thus an anjusted model is usually used additional to the unadjusted model shown above, as following.
Here I will use a subset of the data to demonstrate how to do model selection for covariates in R.
Data¶
A subset of the data is included in a package called mlms posted on my github account. The dataset has 80 observations of infants with their growth outcome variables at 18 months of age, treatment group, and mother's baseline characteristics at enrollment.
devtools::install_github("zhuchcn/mlms") library(mlms) data(growth) head(growth)
#> waz laz hcz treatment momht mbhb asset1 gaatdel totschyrs sex_updated #> 1 -3.00 -1.45 -2.35 IFA 160.35 112 0.09214962 38.6 3 0 #> 2 -0.88 -0.45 -1.65 IFA 170.05 127 1.30253065 39.3 12 0 #> 3 -0.34 -1.20 -0.21 IFA 154.65 118 0.85202587 36.3 6 0 #> 4 -0.86 -0.38 -1.13 LNS 162.25 125 0.59085160 37.8 9 0 #> 5 -1.82 -1.87 -2.19 LNS 158.60 118 -1.59396219 38.6 4 1 #> 6 -0.40 -0.99 -0.42 IFA 158.70 108 -0.95266008 41.3 6 1
The first three columns are three growth outcome variables WAZ (weight for age z-score), LAZ (length for age z-score), and HCZ (head circumference z-score). The forth column is the nutritional supplement treatment, either IFA or LNS. Column 5 to 10 are mother's baseline characteristics. The description of each column can be found from the data's documentation using the following command.
?growth
Unadjusted model¶
We are now fitting an unadjusted model between treatment and the laz
. We are doing so with the lm
function.
fit1 = lm(waz ~ treatment + 1, data = growth) fit1
#> Call: #> lm(formula = waz ~ treatment + 1, data = growth) #> #> Coefficients: #> (Intercept) treatmentLNS #> -0.9382 0.2438
To access the statistics of the linear model result, we use the summary
function.
sum_fit1 = summary(fit1) sum_fit1
#> Call: #> lm(formula = laz ~ treatment + 1, data = growth) #> #> Residuals: #> Min 1Q Median 3Q Max #> -2.98775 -0.58025 -0.08275 0.69225 2.49225 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -0.9723 0.1606 -6.055 4.63e-08 *** #> treatmentLNS 0.3400 0.2271 1.497 0.138 #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> Residual standard error: 1.016 on 78 degrees of freedom #> Multiple R-squared: 0.02794, Adjusted R-squared: 0.01548 #> F-statistic: 2.242 on 1 and 78 DF, p-value: 0.1384
As shown above, the p value is 0.138 in this model.
If you want to only access the statistic part.
sum_fit1$coefficients
#> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -0.97225 0.1605677 -6.055078 4.626456e-08 #> treatmentLNS 0.34000 0.2270770 1.497289 1.383539e-01
Adjusted model¶
We are next going to fit an adjusted model variables in column 5-10 as potential covariates. We are first going to test the correlation between the covariates and the outcome variable laz
using cor.test
function. In the code chunk below, we loop through column 5 to 10 and apply the cor.test
between the laz
and the covariate, and only output the p value.
sapply(growth[5:10], function(covar){ cor.test(growth$laz, covar)$p.value })
#> momht mbhb asset1 gaatdel totschyrs sex_updated #> 0.00192442 0.70791953 0.24337094 0.04841732 0.78633435 0.12737051
From the result above we see the two variables momht
(mother's height) and gaatdel
(gestational age at enrollment) are correlated with laz
(p < 0.1), thus they should be included in the complete model.
We now can fit a complete model.
fit2 = lm(laz ~ treatment + momht + gaatdel + 1, data = growth)
#> Call: #> lm(formula = laz ~ treatment + momht + gaatdel + 1, data = growth) #> #> Coefficients: #> (Intercept) treatmentLNS momht gaatdel #> -12.78807 0.17006 0.05754 0.06975
And get the statistics using summary
sum_fit2 = summary(fit2) sum_fit2
Call: lm(formula = laz ~ treatment + momht + gaatdel + 1, data = growth) Residuals: Min 1Q Median 3Q Max -2.5506 -0.6183 -0.1242 0.5209 2.3776 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -12.78807 3.70382 -3.453 0.000933 *** treatmentLNS 0.17006 0.23103 0.736 0.464067 momht 0.05754 0.02130 2.702 0.008590 ** gaatdel 0.06975 0.05797 1.203 0.232821 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9793 on 72 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: 0.1481, Adjusted R-squared: 0.1126 F-statistic: 4.171 on 3 and 72 DF, p-value: 0.008817
We can see that momht
is unfortunately significant, and there isn't a significant treatment effect (p = 0.464).
mlms¶
Above I demonstrated how to perform model selection for a covariates adjusted model for a single outcome variable. One can simply write some loops and generalize this idea to multiple variables. The mlms is an R package with the model selection implemented that allows users to perform model selection for multiple outcome variables quickly as below.
# A data.frame for the outcome variables X = growth[,1:3] # The design matrix for the primary effect design = model.matrix(~ treatment, data = growth) # A data.frame for the covariates Z = growth[,5:10] # the coefficient for the major effect. This must be one of the column # names in design coef = "treatmentLNS" fit = fit_mlms(X, design, Z, coef) summary(fit)
#> $bivars #> momht mbhb asset1 gaatdel totschyrs sex_updated #> waz 0.16195149 0.73601858 0.3755778 0.59073138 0.8681225 0.3665692 #> laz 0.00192442 0.70791953 0.2433709 0.04841732 0.7863344 0.1273705 #> hcz 0.43447850 0.09449692 0.3143394 0.75104026 0.2396203 0.3651060 #> #> $covars #> momht mbhb asset1 gaatdel totschyrs sex_updated #> waz NA NA NA NA NA NA #> laz 0.008589524 NA NA 0.2328213 NA NA #> hcz NA 0.09057698 NA NA NA NA #> #> $models #> var model Estimate Std. Error CI 2.5 % CI 97.5 % df t value Pr(>|t|) #> 1 waz unadjusted 0.24375000 0.2353540 -0.2248039 0.7123039 78 1.0356738 0.3035546 #> 2 waz adjusted 0.24375000 0.2353540 -0.2248039 0.7123039 78 1.0356738 0.3035546 #> 3 laz unadjusted 0.34000000 0.2270770 -0.1120756 0.7920756 78 1.4972894 0.1383539 #> 4 laz adjusted 0.17005553 0.2310253 -0.2904851 0.6305961 72 0.7360904 0.4640671 #> 5 hcz unadjusted 0.07858974 0.2159208 -0.3513634 0.5085429 77 0.3639749 0.7168743 #> 6 hcz adjusted 0.10390902 0.2137643 -0.3218394 0.5296574 76 0.4860916 0.6283009 #> #> attr(,"class") #> [1] "summary.mlms"