-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathch5_multivariate-linear-models.Rmd
1451 lines (1189 loc) · 51.1 KB
/
ch5_multivariate-linear-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Chapter 5. Multivariate Linear Models"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 500)
library(glue)
library(rethinking)
library(patchwork)
library(tidyverse)
library(conflicted)
conflict_prefer("filter", "dplyr")
theme_set(theme_minimal())
source_scripts()
set.seed(0)
```
- correlation is very common in the real world
- *multivariate regression* is using more than one predictor variable to model an outcome
- reasons to use multivariate regression:
* a way of "controlling" for confounding variables
* multiple causation
* interaction between variables
- this chapter focuses on two thing multivariate models can help with:
* revealing spurious correlations
* revealing important correlations masked by hidden correlations of other variables
- this chapter will also discuss:
* multicolinearity
* categorical variables
## 5.1 Spurious association
- example: correlation between divorce rate and marriage rate
* need to be married to get divorced
* perhaps higher rates of marriage indicate that marriage is more important, leading to fewer divorces
* another predictor: median age at marriage
- we can fit a model of median age predicting divorce rate
* this is the same as in the previous chapter
* $D_i$: divorce rate for state $i$; $A_i$: median age at marriage in state $i$
$$
D_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_A A_i \\
\alpha \sim \text{Normal}(10, 10) \\
\beta_A \sim \text{Normal}(0, 1) \\
\sigma \sim \text{Uniform}(0, 10)
$$
```{r}
# Load data.
data("WaffleDivorce")
d <- WaffleDivorce
# Stadardize predictor.
d$MedianAgeMarriage_std <- (d$MedianAgeMarriage - mean(d$MedianAgeMarriage)) / sd(d$MedianAgeMarriage)
m5_1 <- quap(
alist(
Divorce ~ dnorm(mu, sigma),
mu <- a + bA * MedianAgeMarriage_std,
a ~ dnorm(10, 10),
bA ~ dnorm(0, 1),
sigma ~ dunif(0, 10)
),
data = d
)
summary(m5_1)
```
```{r}
mam_seq <- seq(-3, 3.5, length.out = 30)
mu <- link(m5_1, data = data.frame(MedianAgeMarriage_std = mam_seq))
mu_map <- apply(mu, 2, chainmode)
m5_1_pred <- apply(mu, 2, PI) %>%
t() %>%
as.data.frame() %>%
set_names(c("pi_5", "pi_94")) %>%
as_tibble() %>%
mutate(mam_std = mam_seq,
mu_map = mu_map)
d %>%
ggplot() +
geom_point(aes(x = MedianAgeMarriage_std, y = Divorce)) +
geom_ribbon(data = m5_1_pred,
aes(x = mam_std, ymin = pi_5, ymax = pi_94),
fill = "black", alpha = 0.2, color = NA) +
geom_line(data = m5_1_pred,
aes(x = mam_std, y = mu_map),
color = "blue", alpha = 0.6, lty = 2, size = 1.3)
```
- and we can model the divorce rate on the number of marriages in a state:
* $R_i$: rate of marriage in state $i$
$$
D_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_R R_i \\
\alpha \sim \text{Normal}(10, 10) \\
\beta_A \sim \text{Normal}(0, 1) \\
\sigma \sim \text{Uniform}(0, 10)
$$
```{r}
# Stadardize predictor.
d$Marriage_std <- (d$Marriage - mean(d$Marriage)) / sd(d$Marriage)
m5_2 <- quap(
alist(
Divorce ~ dnorm(mu, sigma),
mu <- a + bR * Marriage_std,
a ~ dnorm(10, 10),
bR ~ dnorm(0, 1),
sigma ~ dunif(0, 10)
),
data = d
)
summary(m5_2)
```
- but individual single-variate models cannot tell us which variable is more important or if they cancel each other out
- the question we want to answer: *"What is the predictive value of a variable, once I already know all of the other predictor variables?"*
* after I know the marriage rate, what additional value is there in also knowing the age at marriage?
* after I know the age at marriage, what additional value is there in also knowing the marriage rate?
### 5.1.1 Multivariate notation
- the strategy for building a multivariate model:
1. nominate the predictor variables you want in the linear model of the mean
2. for each predictor, make a parameter that will measure its association with the outcome
3. multiply the parameter by the variable and add that term to the linear model
- the formula for our multivariate model example on divorce rate:
$$
D_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_R R_i + \beta_A A_i \\
\alpha \sim \text{Normal}(10, 10) \\
\beta_A \sim \text{Normal}(0, 1) \\
\beta_R \sim \text{Normal}(0, 1) \\
\sigma \sim \text{Uniform}(0, 10)
$$
- what does $\mu_i = \alpha + \beta_R R_i + \beta_A A_i$ mean:
* the expected outcome for any state with marriage rate $R_i$ and median age at marriage $A_i$ is the sum of three independent terms
* $\alpha$ is a constant that every state gets
* $\beta_R R_i$ is the marriage rate multiplied against a coefficient $\beta_R$ that measures the association between the marriage rate and divorce rate
* $\beta_A A_i$ is similar to the second term, but for the association with median age at marriage
### 5.1.2 Fitting the model
- we can use the quadratic approximation to fit the model
```{r}
m5_3 <- quap(
alist(
Divorce ~ dnorm(mu, sigma),
mu <- a + bR * Marriage_std + bA * MedianAgeMarriage_std,
a ~ dnorm(10, 10),
bA ~ dnorm(0, 1),
bR ~ dnorm(0, 1),
sigma ~ dunif(0, 10)
),
data = d
)
summary(m5_3)
```
```{r}
plot(summary(m5_3))
```
- now, the coefficient for the marriage rate predictor is about zero and the coefficient of the median age is confidently below zero
we can interpret these to mean: *"Once we know the median age at marriage for a state, there is little predictive power in also knowing the rate of marriage in that state."*
- we can make some plots to investigate how the model came to this conclusion
### 5.1.3 Plotting the multivariate posteriors
- we will use three types of interpretive plots
1. *predictor residual plots*: show the outcome against residual predictor values
2. *counterfactual plots*: show the implied predictions for imaginary experiments in which the different predictor variables can be changed independently of one another
3. *posterior prediction plots*: show model-based predictions against raw data, or otherwise display the error in prediction
#### 5.1.3.1 Predictor residual plots
- *predictor variable residual*: the average prediction error when using all other predictor variables to model a predictor of interest
* plotting this against the outcome shows something like a bivariate regression that has already been "controlled" for all of the other predictors
* it leaves the variation not expected by the model of the mean of the other predictors
- this is best illustrated by an example:
* we will model the marriage rate using the median age at marriage
$$
R_i \sim \text{Normal}(\mu_i \sigma) \\
\mu_i = \alpha + \beta A_i \\
\alpha \sim \text{Normal}(0, 10) \\
\beta \sim \text{Normal}(1, 0) \\
\sigma \sim \text{Uniform}(0, 10)
$$
- since we are using centered variables, $\apha$ should be zero
```{r}
m5_4 <- quap(
alist(
Marriage_std ~ dnorm(mu, sigma),
mu <- a + b*MedianAgeMarriage_std,
a ~ dnorm(0, 10),
b ~ dnorm(0, 1),
sigma ~ dunif(0, 10)
),
data = d
)
summary(m5_4)
```
- we then compute the residuals by subtracting the observed marriage rate in each state from the predicted rate when using median age
* a positive residual means the observed rate was greater than that expected given the median age in that state
```{r}
mu <- coef(m5_4)["a"] + coef(m5_4)["b"] * d$MedianAgeMarriage_std
m_resid <- d$Marriage_std - mu
str(m_resid)
```
```{r}
d %>%
mutate(mu = mu,
resid = m_resid,
resid_diff = mu + m_resid) %>%
ggplot() +
geom_linerange(aes(x = MedianAgeMarriage_std,
ymin = mu, ymax = resid_diff),
size = 0.8, color = "grey60") +
geom_point(aes(x = MedianAgeMarriage_std, y = Marriage_std),
color = "black", size = 2) +
geom_line(aes(x = MedianAgeMarriage_std, y = mu),
color = "tomato", size = 1.3, alpha = 0.7) +
labs(title = "Residual marriage rate estimated using the median age at marriage",
subtitle = "The red line is the estimate, and the vertical lines are the residuals")
```
- we can then plot these residuals against the divorce rate
* this is the linear relationship between divorce and marriage rates after "controlling" for median age
```{r}
d %>%
mutate(mu = mu,
resid = m_resid,
resid_diff = mu + m_resid) %>%
ggplot() +
geom_point(aes(x = resid, y = Divorce),
color = "black", size = 2) +
geom_vline(xintercept = 0, lty = 2, color = "dodgerblue", size = 0.9) +
labs(x = "residual marriage rate",
title = "Residual marriage rate and Divorce",
subtitle = "The linear relationship of marriage and divorce rates after correcting for median age at marriage")
```
- we can do the same calculation in the other direction: find the residual of the median age modeled on the rate
```{r}
m5_4_2 <- quap(
alist(
MedianAgeMarriage_std ~ dnorm(mu, sigma),
mu <- a + b*Marriage_std,
a ~ dnorm(0, 10),
b ~ dnorm(0, 1),
sigma ~ dunif(0, 10)
),
data = d
)
summary(m5_4)
```
```{r}
mu <- coef(m5_4_2)["a"] + coef(m5_4_2)["b"] * d$Marriage_std
m_resid <- d$MedianAgeMarriage_std - mu
p1 <- d %>%
mutate(mu = mu,
resid = m_resid,
resid_diff = mu + m_resid) %>%
ggplot(aes(x = Marriage_std)) +
geom_linerange(aes(ymin = mu, ymax = resid_diff),
size = 0.8, color = "grey60") +
geom_point(aes(y = MedianAgeMarriage_std),
color = "black", size = 2) +
geom_line(aes(y = mu),
color = "tomato", size = 1.3, alpha = 0.7) +
labs(title = "Residual median age estimated\nusing the marriage rate",
subtitle = "The red line is the estimate,\nand the vertical lines are the residuals")
p2 <- d %>%
mutate(mu = mu,
resid = m_resid,
resid_diff = mu + m_resid) %>%
ggplot() +
geom_point(aes(x = resid, y = Divorce),
color = "black", size = 2) +
geom_vline(xintercept = 0, lty = 2, color = "dodgerblue", size = 0.9) +
labs(x = "residual median age",
title = "Residual median age and Divorce",
subtitle = "The linear relationship of median age\nand divorce after correcting for marriage rate")
p1 | p2
```
- the negative slope of the residual median age vs. Divorce (on the right in the above plot) indicates that the median age contains information even after adjusting for marriage rate
#### 5.1.3.2 Counterfactual plots
- this plot displays the *implied* predictions of the model
* we can make predictions for inputs that were never seen or are technically impossible
* e.g.: a high marriage rate and high median age
- the simplest counterfactual plot is to see how predictions change while changing only one predictor
- we will draw two counterfactual plots, one for each predictor
```{r}
# Make new "data" while holding median age constant.
A_avg <- mean(d$MedianAgeMarriage_std)
R_seq <- seq(-3, 3, length.out = 30)
pred_data <- tibble(Marriage_std = R_seq,
MedianAgeMarriage_std = A_avg)
# Compute the counterfactual mu values.
mu <- link(m5_3, data = pred_data)
mu_mean <- apply(mu, 2, mean)
mu_pi <- apply(mu, 2, PI) %>%
pi_to_df() %>%
set_names(c("mu_5_pi", "mu_94_pi"))
# Simulate counterfactual divorce outcomes
R_sim <- sim(m5_3, data = pred_data, n = 1e4)
R_pi <- apply(R_sim, 2, PI) %>%
pi_to_df() %>%
set_names(c("Rsim_5_pi", "Rsim_94_pi"))
R_counterfactual <- pred_data %>%
mutate(mu = mu_mean) %>%
bind_cols(mu_pi, R_pi)
R_counterfactual %>%
ggplot(aes(x = Marriage_std)) +
geom_ribbon(aes(ymin = Rsim_5_pi, ymax = Rsim_94_pi),
color = NA, fill = "black", alpha = 0.2) +
geom_ribbon(aes(ymin = mu_5_pi, ymax = mu_94_pi),
color = NA, fill = "black", alpha = 0.4) +
geom_line(aes(y = mu), color = "black", size = 1.4) +
labs(x = "Marriage_std",
y = "Divorce",
title = "Counterfactual holding median age constant",
subtitle = "The line is the mean divorce rate over marriage rate, holding age constant.
The inner ribbon (darker) is the 95% PI for the mean over marriage rate.
The outer ribbon is the 95% PI of the simulated divorce rates.")
```
```{r}
# Make new "data" while holding median age constant.
R_avg <- mean(d$Marriage_std)
A_seq <- seq(-3, 3, length.out = 30)
pred_data <- tibble(Marriage_std = R_avg,
MedianAgeMarriage_std = A_seq)
# Compute the counterfactual mu values.
mu <- link(m5_3, data = pred_data)
mu_mean <- apply(mu, 2, mean)
mu_pi <- apply(mu, 2, PI) %>%
pi_to_df() %>%
set_names(c("mu_5_pi", "mu_94_pi"))
# Simulate counterfactual divorce outcomes
A_sim <- sim(m5_3, data = pred_data, n = 1e4)
A_pi <- apply(A_sim, 2, PI) %>%
pi_to_df() %>%
set_names(c("Asim_5_pi", "Asim_94_pi"))
A_counterfactual <- pred_data %>%
mutate(mu = mu_mean) %>%
bind_cols(mu_pi, A_pi)
A_counterfactual %>%
ggplot(aes(x = MedianAgeMarriage_std)) +
geom_ribbon(aes(ymin = Asim_5_pi, ymax = Asim_94_pi),
color = NA, fill = "black", alpha = 0.2) +
geom_ribbon(aes(ymin = mu_5_pi, ymax = mu_94_pi),
color = NA, fill = "black", alpha = 0.4) +
geom_line(aes(y = mu), color = "black", size = 1.4) +
labs(x = "MedianAgeMarriage_std",
y = "Divorce",
title = "Counterfactual holding marriage rate constant",
subtitle = "The line is the mean divorce rate over median age, holding marriage rate constant.
The inner ribbon (darker) is the 95% PI for the mean over median age.
The outer ribbon is the 95% PI of the simulated divorce rates.")
```
- these show
* that changing the marriage rate while holding the median age constant, does not cause much change to the predicted rates of divorce
* but changing the median age, holding the marriage rate constant, does predict substantial effects on divorce rate
- counterfactual plots are contentious because of their small-world nature
#### 5.1.3.3 Posterior prediction plots
- it is important to check the model fit against the observed data:
1. Did the model fit correctly?
2. How does the model fail?
* all models fail in some way; important to find the limitations of the predictions
- begin by simulating predictions, averaging over the posterior
```{r}
# Simulate mu using the original data (by not providing new data)
mu <- link(m5_3)
mu_mean <- apply(mu, 2, mean)
mu_pi <- apply(mu, 2, PI)
# Simulate divorce rates using the original data (by not providing new data)
divorce_sim <- sim(m5_3, n = 1e4)
divorce_pi <- apply(divorce_sim, 2, PI)
```
- for multivariate models, there are many ways to display these simulations
- first, we can plot the predictions against the observed
* the dashed line represents perfect correlation between the predicted and observed
* we see that the model under-predicts for states with high divorce rates and over-predicts for those with low rates
```{r}
tibble(mu_mean = mu_mean,
obs_divorce = d$Divorce,
state = as.character(d$Loc)) %>%
mutate(label = ifelse(state %in% c("ID", "UT"), state, NA_character_)) %>%
bind_cols(pi_to_df(mu_pi)) %>%
ggplot(aes(x = obs_divorce, y = mu_mean)) +
geom_pointrange(aes(ymin = x5_percent, ymax = x94_percent)) +
geom_abline(slope = 1, intercept = 0, lty = 2, alpha = 0.6) +
ggrepel::geom_text_repel(aes(label = label), family = "Arial") +
labs(x = "observed divorce",
y = "predicted divorce")
```
- we can also show the residual of the predictions (the prediction error)
* we can arrange this plot from least to most prediction error
```{r, fig.height=4, fig.width=3}
tibble(divorce = d$Divorce,
mu_mean = mu_mean,
state = as.character(d$Loc)) %>%
mutate(divorce_resid = divorce - mu_mean,
state = fct_reorder(state, divorce_resid)) %>%
bind_cols(pi_to_df(mu_pi)) %>%
mutate(x5_percent = divorce - x5_percent,
x94_percent = divorce - x94_percent) %>%
ggplot(aes(x = divorce_resid, y = state)) +
geom_pointrange(aes(xmin = x5_percent, xmax = x94_percent)) +
labs(x = "residual divorce", y = "state")
```
- a third type of plot is to compare the divorce residual against new predictors
* this can show if there are additional predictors that add information
* in the following example, I compare the divorce residual against the number of Waffle Houses per capita in the state
```{r}
label_states <- c("ME", "AR", "AL", "MS", "GA", "SC", "ID", "UT")
tibble(divorce = d$Divorce,
mu_mean = mu_mean,
waffle_houses = d$WaffleHouses,
population = d$Population,
state = as.character(d$Loc)) %>%
mutate(divorce_resid = divorce - mu_mean,
waffles_per_capita = waffle_houses / population,
label = ifelse(state %in% label_states, state, NA_character_)) %>%
bind_cols(pi_to_df(mu_pi)) %>%
ggplot(aes(x = waffles_per_capita, y = divorce_resid)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ x") +
ggrepel::geom_text_repel(aes(label = label), family = "Arial") +
labs(x = "Waffle Houses per capita",
y = "residual divorce")
```
## 5.2 Masked relationship
- another reason to use a multivariate model is to identify influences of multiple factors when none of them show an individual bivariate relationship with the outcome
- for this section, we will use a new data set: the composition of milk across primate species
* for now, we will use:
- `kcal_per_g`: Calories per gram of milk (in Cal/g)
- `mass`: average female body mass (in kg)
- `neocortex_perc`: percent of total brain mass that is the neocortex
```{r}
data(milk)
d <- janitor::clean_names(milk)
str(d)
```
- the following plots provide a little information about these columns - this was not included in the course
```{r}
d %>%
select(clade, kcal_per_g, mass, neocortex_perc) %>%
pivot_longer(-clade, names_to = "name", values_to = "value") %>%
ggplot(aes(x = clade, y = value, color = clade)) +
facet_wrap(~ name, nrow = 1, scales = "free") +
geom_jitter() +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
legend.position = "none")
```
```{r}
d %>%
ggplot(aes(x = kcal_per_g, y = neocortex_perc, color = clade, size = mass)) +
geom_point()
```
- the first model is just a bivariate regression between kilocalories and neocortex percent
* we dropped rows with missing values (in neocortex percent)
```{r}
dcc <- d[complete.cases(d), ]
dim(dcc)
```
```{r}
m5_5 <- quap(
alist(
kcal_per_g ~ dnorm(mu, sigma),
mu <- a + bn*neocortex_perc,
a ~ dnorm(0, 100),
bn ~ dnorm(0, 1),
sigma ~ dunif(0, 1)
),
data = dcc
)
precis(m5_5, digits = 4)
```
- the estimate for the coefficient for the neocortex percent is near zero and not very precise
* the 89% interval is wide on both side of 0
- the next model will use the logarithm of the mother's body mass to predict the kilocalories of the milk
* use the logarithm because we are curious about changes in magnitude
```{r}
dcc$log_mass <- log(dcc$mass)
m5_6 <- quap(
alist(
kcal_per_g ~ dnorm(mu, sigma),
mu <- a + bm*log_mass,
a ~ dnorm(0, 100),
bm ~ dnorm(0, 1),
sigma ~ dunif(0, 1)
),
data = dcc
)
precis(m5_6)
```
- the mean coefficient is still very small and imprecise (large 89% interval)
- the next model includes both the neocortex percent and mother's mass as predictors of kilocalories
$$
k_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_n n_i + \beta_m \log(m_i) \\
\alpha \sim \text{Normal}(0, 100) \\
\beta_n \sim \text{Normal}(0, 1) \\
\beta_m \sim \text{Normal}(0, 1) \\
\sigma \sim \text{Uniform}(0, 1)
$$
```{r}
m5_7 <- quap(
alist(
kcal_per_g ~ dnorm(mu, sigma),
mu <- a + bn*neocortex_perc + bm*log_mass,
a ~ dnorm(0, 100),
bn ~ dnorm(0, 1),
bm ~ dnorm(0, 1),
sigma ~ dunif(0, 1)
),
data = dcc
)
precis(m5_7)
```
- incorporating both predictor variables the estimation association of both with the kilocalories of the milk has increased
* the posterior mean for the neocortex has increased almost a magnitude and the 89% interval is confidently beyond 0
* the posterior mean for the log mass is even more negative
- the following plot shows this relationship with the size of the point as the kilocalories of the milk
```{r}
dcc %>%
ggplot(aes(x = log_mass, y = neocortex_perc, size = kcal_per_g)) +
geom_point()
```
- we can make counterfactual plots for each variable
```{r}
# Counterfactual holding log mass constant.
avg_log_mass <- mean(dcc$log_mass)
neocortex_seq <- seq(50, 80, length.out = 30)
data_n <- tibble(neocortex_perc = neocortex_seq,
log_mass = avg_log_mass)
mu_n <- link(m5_7, data = data_n)
mu_n_mean <- apply(mu_n, 2, mean)
mu_n_pi <- apply(mu_n, 2, PI) %>%
pi_to_df()
# Counterfactual holding neocortex percent constant.
avg_neopct <- mean(dcc$neocortex_perc)
logmass_seq <- seq(-2.2, 4.5, length.out = 30)
data_m <- tibble(neocortex_perc = avg_neopct,
log_mass = logmass_seq)
mu_m <- link(m5_7, data = data_m)
mu_m_mean <- apply(mu_m, 2, mean)
mu_m_pi <- apply(mu_m, 2, PI) %>%
pi_to_df()
# Combine data and plot.
bind_rows(
tibble(counterfactual = "neocortex percent",
mu_mean = mu_n_mean,
x = neocortex_seq) %>%
bind_cols(mu_n_pi),
tibble(counterfactual = "log mass",
mu_mean = mu_m_mean,
x = logmass_seq) %>%
bind_cols(mu_m_pi)
) %>%
ggplot(aes(x = x)) +
facet_wrap(~ counterfactual, scales = "free", nrow = 1) +
geom_ribbon(aes(ymin = x5_percent, ymax = x94_percent),
fill = "black", alpha = 0.3, color = "black", lty = 2) +
geom_line(aes(y = mu_mean),
color = "black", lty = 1) +
labs(title = "Counterfactual plots for model 'm5_7'",
subtitle = "For each plot, the other variable was set constant",
x = "variable value",
y = "kilocalorie energy")
```
- these variables only became useful when used together because they both correlate with the outcome, but one is positive and the other is negative
* also, these two variables are positively correlated
* therefore, they tend to "mask" each other until both are used in the model
## 5.3 When adding variables hurts
- why not use all of the available predictors for the model?
1. *multicolinearity*
2. *post-treatment bias*
3. *overfitting*
### 5.3.1 Multicollinear legs
- we start with an example: predicting height using length of a persons leg
* they are correlated, but problems happen when both legs are included in the model
- we simulate 100 individuals
* height is sampled from a Gaussian
* each person gets a simulated proportion of height for their legs, ranging from 0.4 to 0.5
* the second leg is made from the first with a touch of error
```{r}
# Create data.
N <- 100
height <- rnorm(N, 10, 2)
leg_prop <- runif(N, 0.4, 0.5)
leg_left <- leg_prop * height + rnorm(N, 0, 0.02)
leg_right <- leg_prop * height + rnorm(N, 0, 0.02)
d <- tibble(height, leg_left, leg_right)
# Model.
m5_8 <- quap(
alist(
height ~ dnorm(mu, sigma),
mu <- a + bl*leg_left + br*leg_right,
a ~ dnorm(10, 100),
bl ~ dnorm(2, 10),
br ~ dnorm(2, 10),
sigma ~ dunif(0, 10)
),
data = d
)
summary(m5_8)
```
```{r}
plot(summary(m5_8))
```
- the model fit correctly, remember the question that linear regression answers:
* "What is the value of knowing each predictor, after already knowing all of the other predictors?"
* for this example: "What is the value of knowing each leg's length, after already knowing the other leg's length?"
- another way of thinking about this is that we have fitted the following formula
* there are almost an infinite number of pairs of $\beta_1$ and $\beta_2$ that would produce a given $y$ value for a given $x$
$$
y_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_1 x_i + \beta_2 x_i \\
\mu_i = \alpha + (\beta_1 + \beta_2) x_i = \alpha + \beta_3 x_i
$$
- *when two predictor variables are very strongly correlated, including both in a model may lead to confusion*
- importantly, this model makes fine predictions about height given the length of the left and right legs, it just doesn't say anything about which leg is more important
### 5.3.2 Multicollinear milk
- the leg example above is a bit obvious, but how do we identify this colinear effect in real data
* for this we return to the milk example
- we will model `kcal_per_g` using `perc_fat` and `perc_lactose` in two bivariate regressions
```{r}
d <- milk %>% as_tibble() %>% janitor::clean_names()
# kilocalories per gram regressed on percent fat
m5_10 <- quap(
alist(
kcal_per_g ~ dnorm(mu, sigma),
mu <- a + bf*perc_fat,
a ~ dnorm(0.6, 10),
bf ~ dnorm(0, 1),
sigma ~ dunif(0, 10)
),
data = d
)
# kilocalories per gram regressed on percent lactose
m5_11 <- quap(
alist(
kcal_per_g ~ dnorm(mu, sigma),
mu <- a + bl*perc_lactose,
a ~ dnorm(0.6, 10),
bl ~ dnorm(0, 1),
sigma ~ dunif(0, 10)
),
data = d
)
precis(m5_10, digits = 3)
precis(m5_11, digits = 3)
```
- summary of `m5_10` and `m5_11`:
* posterior mean for $\beta_f$ is 0.010 with 89% interval [0.008, 0.012]
* posterior mean for $\beta_l$ is -0.011 with 89% interval [-0.012, -0.009]
- now we model kilocalories per gram using both predictors
```{r}
# kilocalories per gram regressed on percent fat and percent lactose
m5_12 <- quap(
alist(
kcal_per_g ~ dnorm(mu, sigma),
mu <- a + bf*perc_fat + bl*perc_lactose,
a ~ dnorm(0.6, 10),
bf ~ dnorm(0, 1),
bl ~ dnorm(0, 1),
sigma ~ dunif(0, 10)
),
data = d
)
precis(m5_10, digits = 3)
```
- summary of `m5_12`:
* both posterior means for $\beta_f$ and $\beta_l$ are closer to 0
* the standard deviations of both coefficient's posteriors are larger, too
* the estimate of $\beta_f$ is effectively 0
- the problem is the same as in the leg example: the percent fat and lactose contain roughly the same information
* the plot below shows these correlations
* in the middle-right scatter plot we can see the strong correlation between fat and lactose
* below that the correlations are shown
```{r}
pairs(~ kcal_per_g + perc_fat + perc_lactose, data = d)
```
```{r}
cor(d$perc_fat, d$perc_lactose)
```
- not only does the correlation between pairs of variables matter, but also the correlation that remains after accounting for any other predictors
- if we run simulations of fitting the model predicting kilocalories per gram with percent fat and a new variable that is correlated with percent fat at different amounts, we can plot the standard deviation of the coefficient of percent fat against the correlation of the two predictors
* this shows that the effect is not linear
* instead, as the correlation coefficient increases, the standard deviation of the posterior of the estimate increases exponentially
- it is difficult to identify colinearity, especially with many variates in the model
* just keep an eye out for large standard deviations of the posteriors
* one technique for demonstrating multicolinearity is to make a model with each variable and show that the models make nearly the same predictions
- multicolinearity is in a family of problems with fitting models called *non-identifiability*
* when a parameter is non-identifiable, it means the structure of the data and model do not make it possible to estimate the parameter's value
### 5.3.3 Post-treatment bias
- *omitted variable bias*: mistaken inferences from omitting predictor variables
- *post-treatment bias*: mistaken inferences from including variables that are consequences of other variables
- example with plants:
* growing some plants in a greenhouse
* want to know the difference in growth under different antifungal soil treatments
* plants are grown from seed, their heights measured, treated with antifungals, final measures of heights are taken
* the presence of fungus is also measured
* if we are trying to make a causal inference on height, we should not include the presence of fungus
- the presence of fungus is a *post-treatment effect*
- below is a simulation to help show what goes wrong when included a post-treatment effect
* probability of fungus was 0.5 without treatment and 0.1 with treatment
* the final height was the starting height plus a sample from a normal distribution with mean of either 5 or 2 for no treatment or treatment
```{r}
set.seed(123)
# Number of plants
N <- 100
# Simulate initial heights
h0 <- rnorm(N, 10, 2)
# Assign treatments and simulate fungus and growth
treatment <- rep(0:1, each = N/2)
fungus <- rbinom(N, size = 1, prob = 0.5 - treatment * 0.4)
h1 <- h0 + rnorm(N, 5 - 3 * fungus)
# Final data frame
d <- tibble(h0, h1, treatment, fungus)
d
```
- we can model the final height on the starting height, if treated, and if fungus
```{r}
m5_13 <- quap(
alist(
h1 ~ dnorm(mu, sigma),
mu <- a + bh*h0 + bt*treatment + bf*fungus,
a ~ dnorm(0, 100),
c(bh, bt, bf) ~ dnorm(0, 10),
sigma ~ dunif(0, 10)
),
data = d
)
precis(m5_13, digits = 3)
```
- summary of `m5_13`:
* the coefficient for treatment is basically 0 with the 89% intervals straddling 0
* the coefficients for starting height and presence of fungus were very strong
- the problem is that we built this data manually and know that treatment should matter
* however, fungus is mostly a consequence of treatment and should be omitted
* this is done below and now the impact of treatment is strong as positive, as expected
```{r}
m5_14 <- quap(
alist(
h1 ~ dnorm(mu, sigma),
mu <- a + bh*h0 + bt*treatment,
a ~ dnorm(0, 100),
c(bh, bt) ~ dnorm(0, 10),
sigma ~ dunif(0, 10)
),
data = d
)
precis(m5_14, digits = 3)
```
## 5.4 Categorical variables
### 5.4.1 Binary categories
- the simplest case is when the variable has only two cases
- we will use the `Howell1` Kalahari data again
* the `male` predictor is binary
* it is an example of a *dummy variable*
```{r}
data("Howell1")
d <- Howell1
skimr::skim(d)
```
- the dummy variable turns a parameter "on" for those cases in the category
- here is the formula for a model predicting height using `male`
* the parameter $\beta_m$ only influences the cases where $\m_i = 1$ and is canceled out when $\mu_i = 0$
$$
h_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i \sim \alpha + \beta_m \mu_i \\
\alpha \sim \text{Normal}(178, 100) \\
\beta_m \sim \text{Normal}(0, 10) \\
\sigma \sim \text{Uniform}(0, 50)
$$
```{r}
m5_15 <- quap(
alist(
height ~ dnorm(mu, sigma),
mu <- a + bm*male,
a ~ dnorm(178, 100),
bm ~ dnorm(0, 10),
sigma ~ dunif(0, 50)
),
data = d
)
summary(m5_15)
```
- summary of `m5_15`:
* $\alpha$ is now the average height of females:
- $\mu_i = \alpha + \beta_m (0) = \alpha$
- therefore, the expected height of the average female is 135 cm
* the parameter $\beta_m$ is the average difference between males and females
- therefore, the average male height is 135 cm + 7.27 cm = 142.3 cm
- we can get posterior distribution for male height by sampling $\alpha$ and $\beta_m$ from the model and adding them together
* we cannot just add the 89% interval values together because $\alpha$ and $\beta_m$ are correlated
```{r}
post <- extract.samples(m5_15)
mu_male <- post$a + post$bm
chainmode(mu_male)
PI(mu_male)
```
### 5.4.2 Many categories
- need one dummy variable per category except for one which will be in the intercept
- we will use the primate milk data for an example
```{r}
d <- as_tibble(milk) %>% janitor::clean_names()
unique(d$clade)
```
- (for now) we will make the dummy variables by hand
```{r}
d$clade_nwm <- as.numeric(d$clade == "New World Monkey")
d$clade_owm <- as.numeric(d$clade == "Old World Monkey")
d$clade_s <- as.numeric(d$clade == "Strepsirrhine")
```
- the model we will fit is below
* it really defines 4 separate linear models, one for each clade (ape is in $\alpha$)
$$
k_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_{\text{NWM}} \text{NWM}_i + \beta_{\text{OWM}} \text{OWM}_i + \beta_{S} \text{NWM}_i \\
\alpha \sim \text{Normal}(0.6, 10) \\
\beta_{\text{NWM}} \sim \text{Normal}(0, 1) \\
\beta_{\text{OWM}} \sim \text{Normal}(0, 1) \\
\beta_{\text{S}} \sim \text{Normal}(0, 1) \\
\sigma \sim \text{Uniform}(0, 10)
$$
```{r}
m5_16 <- quap(
alist(
kcal_per_g ~ dnorm(mu, sigma),
mu <- a + bNWM * clade_nwm + bOWM * clade_owm + bS * clade_s,
a ~ dnorm(0.6, 10),
c(bNWM, bOWM, bS) ~ dnorm(0, 1),
sigma ~ dunif(0, 10)
),
data = d
)
precis(m5_16, digits = 4)
```
- sample from the posteriors to get the posteriors for the heights of each ape
```{r}
post <- extract.samples(m5_16)
mu_ape <- post$a
mu_nwm <- post$a + post$bNWM
mu_owm <- post$a + post$bOWM
mu_s <- post$a + post$bS
precis(tibble(mu_ape, mu_nwm, mu_owm, mu_s))
```
- additional comparisons between the clades can be made with these estimates
* we can estimate the difference between new world and old world monkeys
```{r}
diff_nwm_old <- mu_nwm - mu_owm
quantile(diff_nwm_old, probs = c(0.025, 0.5, 0.975))
```