ch5_multivariate-linear-models.Rmd

---
title: "Chapter 5. Multivariate Linear Models"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 500)

library(glue)
library(rethinking)
library(patchwork)
library(tidyverse)
library(conflicted)

conflict_prefer("filter", "dplyr")
theme_set(theme_minimal())

source_scripts()

set.seed(0)
```

- correlation is very common in the real world
- *multivariate regression* is using more than one predictor variable to model an outcome
- reasons to use multivariate regression:
    * a way of "controlling" for confounding variables
    * multiple causation
    * interaction between variables
- this chapter focuses on two thing multivariate models can help with:
    * revealing spurious correlations
    * revealing important correlations masked by hidden correlations of other variables
- this chapter will also discuss:
    * multicolinearity
    * categorical variables

## 5.1 Spurious association

- example: correlation between divorce rate and marriage rate
    * need to be married to get divorced
    * perhaps higher rates of marriage indicate that marriage is more important, leading to fewer divorces
    * another predictor: median age at marriage
- we can fit a model of median age predicting divorce rate
    * this is the same as in the previous chapter
    * $D_i$: divorce rate for state $i$; $A_i$: median age at marriage in state $i$

$$
D_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_A A_i \\
\alpha \sim \text{Normal}(10, 10) \\
\beta_A \sim \text{Normal}(0, 1) \\
\sigma \sim \text{Uniform}(0, 10)
$$

```{r}
# Load data.
data("WaffleDivorce")
d <- WaffleDivorce

# Stadardize predictor.
d$MedianAgeMarriage_std <- (d$MedianAgeMarriage - mean(d$MedianAgeMarriage)) / sd(d$MedianAgeMarriage)

m5_1 <- quap(
    alist(
        Divorce ~ dnorm(mu, sigma),
        mu <- a + bA * MedianAgeMarriage_std,
        a ~ dnorm(10, 10),
        bA ~ dnorm(0, 1),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

summary(m5_1)
```

```{r}
mam_seq <- seq(-3, 3.5, length.out = 30)

mu <- link(m5_1, data = data.frame(MedianAgeMarriage_std = mam_seq))
mu_map <- apply(mu, 2, chainmode)
m5_1_pred <- apply(mu, 2, PI) %>%
    t() %>%
    as.data.frame() %>%
    set_names(c("pi_5", "pi_94")) %>%
    as_tibble() %>%
    mutate(mam_std = mam_seq,
           mu_map = mu_map)

d %>%
    ggplot() +
    geom_point(aes(x = MedianAgeMarriage_std, y = Divorce)) +
    geom_ribbon(data = m5_1_pred,
                aes(x = mam_std, ymin = pi_5, ymax = pi_94),
                fill = "black", alpha = 0.2, color = NA) +
    geom_line(data = m5_1_pred,
                aes(x = mam_std, y = mu_map),
                color = "blue", alpha = 0.6, lty = 2, size = 1.3)
```

- and we can model the divorce rate on the number of marriages in a state:
    * $R_i$: rate of marriage in state $i$

$$
D_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_R R_i \\
\alpha \sim \text{Normal}(10, 10) \\
\beta_A \sim \text{Normal}(0, 1) \\
\sigma \sim \text{Uniform}(0, 10)
$$

```{r}
# Stadardize predictor.
d$Marriage_std <- (d$Marriage - mean(d$Marriage)) / sd(d$Marriage)

m5_2 <- quap(
    alist(
        Divorce ~ dnorm(mu, sigma),
        mu <- a + bR * Marriage_std,
        a ~ dnorm(10, 10),
        bR ~ dnorm(0, 1),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

summary(m5_2)
```

- but individual single-variate models cannot tell us which variable is more important or if they cancel each other out
- the question we want to answer: *"What is the predictive value of a variable, once I already know all of the other predictor variables?"*
    * after I know the marriage rate, what additional value is there in also knowing the age at marriage?
    * after I know the age at marriage, what additional value is there in also knowing the marriage rate?

### 5.1.1 Multivariate notation

- the strategy for building a multivariate model:
    1. nominate the predictor variables you want in the linear model of the mean
    2. for each predictor, make a parameter that will measure its association with the outcome
    3. multiply the parameter by the variable and add that term to the linear model
- the formula for our multivariate model example on divorce rate:

$$
D_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_R R_i + \beta_A A_i \\
\alpha \sim \text{Normal}(10, 10) \\
\beta_A \sim \text{Normal}(0, 1) \\
\beta_R \sim \text{Normal}(0, 1) \\
\sigma \sim \text{Uniform}(0, 10)
$$

- what does $\mu_i = \alpha + \beta_R R_i + \beta_A A_i$ mean:
    * the expected outcome for any state with marriage rate $R_i$ and median age at marriage $A_i$ is the sum of three independent terms
    * $\alpha$ is a constant that every state gets
    * $\beta_R R_i$ is the marriage rate multiplied against a coefficient $\beta_R$ that measures the association between the marriage rate and divorce rate
    * $\beta_A A_i$ is similar to the second term, but for the association with median age at marriage

### 5.1.2 Fitting the model

- we can use the quadratic approximation to fit the model

```{r}
m5_3 <- quap(
    alist(
        Divorce ~ dnorm(mu, sigma),
        mu <- a + bR * Marriage_std + bA * MedianAgeMarriage_std,
        a ~ dnorm(10, 10),
        bA ~ dnorm(0, 1),
        bR ~ dnorm(0, 1),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

summary(m5_3)
```

```{r}
plot(summary(m5_3))
```

- now, the coefficient for the marriage rate predictor is about zero and the coefficient of the median age is confidently below zero
    we can interpret these to mean: *"Once we know the median age at marriage for a state, there is little predictive power in also knowing the rate of marriage in that state."*
- we can make some plots to investigate how the model came to this conclusion

### 5.1.3 Plotting the multivariate posteriors

- we will use three types of interpretive plots
    1. *predictor residual plots*: show the outcome against residual predictor values
    2. *counterfactual plots*: show the implied predictions for imaginary experiments in which the different predictor variables can be changed independently of one another
    3. *posterior prediction plots*: show model-based predictions against raw data, or otherwise display the error in prediction

#### 5.1.3.1 Predictor residual plots

- *predictor variable residual*: the average prediction error when using all other predictor variables to model a predictor of interest
    * plotting this against the outcome shows something like a bivariate regression that has already been "controlled" for all of the other predictors
    * it leaves the variation not expected by the model of the mean of the other predictors
- this is best illustrated by an example:
    * we will model the marriage rate using the median age at marriage

$$
R_i \sim \text{Normal}(\mu_i \sigma) \\
\mu_i = \alpha + \beta A_i \\
\alpha \sim \text{Normal}(0, 10) \\
\beta \sim \text{Normal}(1, 0) \\
\sigma \sim \text{Uniform}(0, 10)
$$

- since we are using centered variables, $\apha$ should be zero

```{r}
m5_4 <- quap(
    alist(
        Marriage_std ~ dnorm(mu, sigma),
        mu <- a + b*MedianAgeMarriage_std,
        a ~ dnorm(0, 10),
        b ~ dnorm(0, 1),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

summary(m5_4)
```

- we then compute the residuals by subtracting the observed marriage rate in each state from the predicted rate when using median age
    * a positive residual means the observed rate was greater than that expected given the median age in that state

```{r}
mu <- coef(m5_4)["a"] + coef(m5_4)["b"] * d$MedianAgeMarriage_std
m_resid <- d$Marriage_std - mu
str(m_resid)
```

```{r}
d %>%
    mutate(mu = mu,
           resid = m_resid,
           resid_diff = mu + m_resid) %>%
    ggplot() +
    geom_linerange(aes(x = MedianAgeMarriage_std, 
                       ymin = mu, ymax = resid_diff),
                   size = 0.8, color = "grey60") +
    geom_point(aes(x = MedianAgeMarriage_std, y = Marriage_std),
               color = "black", size = 2) +
    geom_line(aes(x = MedianAgeMarriage_std, y = mu),
              color = "tomato", size = 1.3, alpha = 0.7) +
    labs(title = "Residual marriage rate estimated using the median age at marriage",
         subtitle = "The red line is the estimate, and the vertical lines are the residuals")
```

- we can then plot these residuals against the divorce rate
    * this is the linear relationship between divorce and marriage rates after "controlling" for median age
    
```{r}
d %>%
    mutate(mu = mu,
           resid = m_resid,
           resid_diff = mu + m_resid) %>%
    ggplot() +
    geom_point(aes(x = resid, y = Divorce),
               color = "black", size = 2) +
    geom_vline(xintercept = 0, lty = 2, color = "dodgerblue", size = 0.9) +
    labs(x = "residual marriage rate",
         title = "Residual marriage rate and Divorce",
         subtitle = "The linear relationship of marriage and divorce rates after correcting for median age at marriage")
```

- we can do the same calculation in the other direction: find the residual of the median age modeled on the rate

```{r}
m5_4_2 <- quap(
    alist(
        MedianAgeMarriage_std ~ dnorm(mu, sigma),
        mu <- a + b*Marriage_std,
        a ~ dnorm(0, 10),
        b ~ dnorm(0, 1),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

summary(m5_4)
```

```{r}
mu <- coef(m5_4_2)["a"] + coef(m5_4_2)["b"] * d$Marriage_std
m_resid <- d$MedianAgeMarriage_std - mu


p1 <- d %>%
    mutate(mu = mu,
           resid = m_resid,
           resid_diff = mu + m_resid) %>%
    ggplot(aes(x = Marriage_std)) +
    geom_linerange(aes(ymin = mu, ymax = resid_diff),
                   size = 0.8, color = "grey60") +
    geom_point(aes(y = MedianAgeMarriage_std),
               color = "black", size = 2) +
    geom_line(aes(y = mu),
              color = "tomato", size = 1.3, alpha = 0.7) +
    labs(title = "Residual median age estimated\nusing the marriage rate",
         subtitle = "The red line is the estimate,\nand the vertical lines are the residuals")

p2 <- d %>%
    mutate(mu = mu,
           resid = m_resid,
           resid_diff = mu + m_resid) %>%
    ggplot() +
    geom_point(aes(x = resid, y = Divorce),
               color = "black", size = 2) +
    geom_vline(xintercept = 0, lty = 2, color = "dodgerblue", size = 0.9) +
    labs(x = "residual median age",
         title = "Residual median age and Divorce",
         subtitle = "The linear relationship of median age\nand divorce after correcting for marriage rate")

p1 | p2
```

- the negative slope of the residual median age vs. Divorce (on the right in the above plot) indicates that the median age contains information even after adjusting for marriage rate

#### 5.1.3.2 Counterfactual plots

- this plot displays the *implied* predictions of the model
    * we can make predictions for inputs that were never seen or are technically impossible
    * e.g.: a high marriage rate and high median age
- the simplest counterfactual plot is to see how predictions change while changing only one predictor
- we will draw two counterfactual plots, one for each predictor

```{r}
# Make new "data" while holding median age constant.
A_avg <- mean(d$MedianAgeMarriage_std)
R_seq <- seq(-3, 3, length.out = 30)
pred_data <- tibble(Marriage_std = R_seq,
                    MedianAgeMarriage_std = A_avg)

# Compute the counterfactual mu values.
mu <- link(m5_3, data = pred_data)
mu_mean <- apply(mu, 2, mean)
mu_pi <- apply(mu, 2, PI) %>% 
    pi_to_df() %>%
    set_names(c("mu_5_pi", "mu_94_pi"))

# Simulate counterfactual divorce outcomes
R_sim <- sim(m5_3, data = pred_data, n = 1e4)
R_pi <- apply(R_sim, 2, PI) %>% 
    pi_to_df() %>%
    set_names(c("Rsim_5_pi", "Rsim_94_pi"))

R_counterfactual <- pred_data %>%
    mutate(mu = mu_mean) %>%
    bind_cols(mu_pi, R_pi)

R_counterfactual %>%
    ggplot(aes(x = Marriage_std)) +
    geom_ribbon(aes(ymin = Rsim_5_pi, ymax = Rsim_94_pi),
                color = NA, fill = "black", alpha = 0.2) +
    geom_ribbon(aes(ymin = mu_5_pi, ymax = mu_94_pi),
                color = NA, fill = "black", alpha = 0.4) +
    geom_line(aes(y = mu), color = "black", size = 1.4) +
    labs(x = "Marriage_std",
         y = "Divorce",
         title = "Counterfactual holding median age constant",
         subtitle = "The line is the mean divorce rate over marriage rate, holding age constant.
The inner ribbon (darker) is the 95% PI for the mean over marriage rate.
The outer ribbon is the 95% PI of the simulated divorce rates.")
```

```{r}
# Make new "data" while holding median age constant.
R_avg <- mean(d$Marriage_std)
A_seq <- seq(-3, 3, length.out = 30)
pred_data <- tibble(Marriage_std = R_avg,
                    MedianAgeMarriage_std = A_seq)

# Compute the counterfactual mu values.
mu <- link(m5_3, data = pred_data)
mu_mean <- apply(mu, 2, mean)
mu_pi <- apply(mu, 2, PI) %>% 
    pi_to_df() %>%
    set_names(c("mu_5_pi", "mu_94_pi"))

# Simulate counterfactual divorce outcomes
A_sim <- sim(m5_3, data = pred_data, n = 1e4)
A_pi <- apply(A_sim, 2, PI) %>% 
    pi_to_df() %>%
    set_names(c("Asim_5_pi", "Asim_94_pi"))

A_counterfactual <- pred_data %>%
    mutate(mu = mu_mean) %>%
    bind_cols(mu_pi, A_pi)

A_counterfactual %>%
    ggplot(aes(x = MedianAgeMarriage_std)) +
    geom_ribbon(aes(ymin = Asim_5_pi, ymax = Asim_94_pi),
                color = NA, fill = "black", alpha = 0.2) +
    geom_ribbon(aes(ymin = mu_5_pi, ymax = mu_94_pi),
                color = NA, fill = "black", alpha = 0.4) +
    geom_line(aes(y = mu), color = "black", size = 1.4) +
    labs(x = "MedianAgeMarriage_std",
         y = "Divorce",
         title = "Counterfactual holding marriage rate constant",
         subtitle = "The line is the mean divorce rate over median age, holding marriage rate constant.
The inner ribbon (darker) is the 95% PI for the mean over median age.
The outer ribbon is the 95% PI of the simulated divorce rates.")
```

- these show 
    * that changing the marriage rate while holding the median age constant, does not cause much change to the predicted rates of divorce
    * but changing the median age, holding the marriage rate constant, does predict substantial effects on divorce rate
- counterfactual plots are contentious because of their small-world nature

#### 5.1.3.3 Posterior prediction plots

- it is important to check the model fit against the observed data:
    1. Did the model fit correctly?
    2. How does the model fail?
        * all models fail in some way; important to find the limitations of the predictions
- begin by simulating predictions, averaging over the posterior

```{r}
# Simulate mu using the original data (by not providing new data)
mu <- link(m5_3)
mu_mean <- apply(mu, 2, mean)
mu_pi <- apply(mu, 2, PI)

# Simulate divorce rates using the original data (by not providing new data)
divorce_sim <- sim(m5_3, n = 1e4)
divorce_pi <- apply(divorce_sim, 2, PI)
```

- for multivariate models, there are many ways to display these simulations
- first, we can plot the predictions against the observed
    * the dashed line represents perfect correlation between the predicted and observed
    * we see that the model under-predicts for states with high divorce rates and over-predicts for those with low rates

```{r}
tibble(mu_mean = mu_mean,
       obs_divorce = d$Divorce,
       state = as.character(d$Loc)) %>%
    mutate(label = ifelse(state %in% c("ID", "UT"), state, NA_character_)) %>%
    bind_cols(pi_to_df(mu_pi)) %>%
    ggplot(aes(x = obs_divorce, y = mu_mean)) +
    geom_pointrange(aes(ymin = x5_percent, ymax = x94_percent)) +
    geom_abline(slope = 1, intercept = 0, lty = 2, alpha = 0.6) +
    ggrepel::geom_text_repel(aes(label = label), family = "Arial") +
    labs(x = "observed divorce",
         y = "predicted divorce")
```

- we can also show the residual of the predictions (the prediction error)
    * we can arrange this plot from least to most prediction error

```{r, fig.height=4, fig.width=3}
tibble(divorce = d$Divorce,
       mu_mean = mu_mean,
       state = as.character(d$Loc)) %>%
    mutate(divorce_resid = divorce - mu_mean,
           state = fct_reorder(state, divorce_resid)) %>%
    bind_cols(pi_to_df(mu_pi)) %>%
    mutate(x5_percent = divorce - x5_percent,
           x94_percent = divorce - x94_percent) %>%
    ggplot(aes(x = divorce_resid, y = state)) +
    geom_pointrange(aes(xmin = x5_percent, xmax = x94_percent)) +
    labs(x = "residual divorce", y = "state")
```

- a third type of plot is to compare the divorce residual against new predictors
    * this can show if there are additional predictors that add information
    * in the following example, I compare the divorce residual against the number of Waffle Houses per capita in the state

```{r}
label_states <- c("ME", "AR", "AL", "MS", "GA", "SC", "ID", "UT")

tibble(divorce = d$Divorce,
       mu_mean = mu_mean,
       waffle_houses = d$WaffleHouses,
       population = d$Population,
       state = as.character(d$Loc)) %>%
    mutate(divorce_resid = divorce - mu_mean,
           waffles_per_capita = waffle_houses / population,
           label = ifelse(state %in% label_states, state, NA_character_)) %>%
    bind_cols(pi_to_df(mu_pi)) %>%
    ggplot(aes(x = waffles_per_capita, y = divorce_resid)) +
    geom_point() +
    geom_smooth(method = "lm", formula = "y ~ x") +
    ggrepel::geom_text_repel(aes(label = label), family = "Arial") +
    labs(x = "Waffle Houses per capita",
         y = "residual divorce")
```

## 5.2 Masked relationship

- another reason to use a multivariate model is to identify influences of multiple factors when none of them show an individual bivariate relationship with the outcome
- for this section, we will use a new data set: the composition of milk across primate species
    * for now, we will use:
        - `kcal_per_g`: Calories per gram of milk (in Cal/g)
        - `mass`: average female body mass (in kg)
        - `neocortex_perc`: percent of total brain mass that is the neocortex

```{r}
data(milk)
d <- janitor::clean_names(milk)
str(d)
```

- the following plots provide a little information about these columns - this was not included in the course

```{r}
d %>%
    select(clade, kcal_per_g, mass, neocortex_perc) %>%
    pivot_longer(-clade, names_to = "name", values_to = "value") %>%
    ggplot(aes(x = clade, y = value, color = clade)) +
    facet_wrap(~ name, nrow = 1, scales = "free") +
    geom_jitter() +
    theme(axis.text.x = element_text(angle = 30, hjust = 1),
          legend.position = "none")
```

```{r}
d %>%
    ggplot(aes(x = kcal_per_g, y = neocortex_perc, color = clade, size = mass)) +
    geom_point()
```

- the first model is just a bivariate regression between kilocalories and neocortex percent
    * we dropped rows with missing values (in neocortex percent)

```{r}
dcc <- d[complete.cases(d), ]
dim(dcc)
```

```{r}
m5_5 <- quap(
    alist(
        kcal_per_g ~ dnorm(mu, sigma),
        mu <- a + bn*neocortex_perc,
        a ~ dnorm(0, 100),
        bn ~ dnorm(0, 1),
        sigma ~ dunif(0, 1)
    ),
    data = dcc
)

precis(m5_5, digits = 4)
```

- the estimate for the coefficient for the neocortex percent is near zero and not very precise
    * the 89% interval is wide on both side of 0
- the next model will use the logarithm of the mother's body mass to predict the kilocalories of the milk
    * use the logarithm because we are curious about changes in magnitude

```{r}
dcc$log_mass <- log(dcc$mass)

m5_6 <- quap(
    alist(
        kcal_per_g ~ dnorm(mu, sigma),
        mu <- a + bm*log_mass,
        a ~ dnorm(0, 100),
        bm ~ dnorm(0, 1),
        sigma ~ dunif(0, 1)
    ),
    data = dcc
)

precis(m5_6)
```

- the mean coefficient is still very small and imprecise (large 89% interval)
- the next model includes both the neocortex percent and mother's mass as predictors of kilocalories

$$
k_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_n n_i + \beta_m \log(m_i) \\
\alpha \sim \text{Normal}(0, 100) \\
\beta_n \sim \text{Normal}(0, 1) \\
\beta_m \sim \text{Normal}(0, 1) \\
\sigma \sim \text{Uniform}(0, 1)
$$

```{r}
m5_7 <- quap(
    alist(
        kcal_per_g ~ dnorm(mu, sigma),
        mu <- a + bn*neocortex_perc + bm*log_mass,
        a ~ dnorm(0, 100),
        bn ~ dnorm(0, 1),
        bm ~ dnorm(0, 1),
        sigma ~ dunif(0, 1)
    ),
    data = dcc
)

precis(m5_7)
```

- incorporating both predictor variables the estimation association of both with the kilocalories of the milk has increased
    * the posterior mean for the neocortex has increased almost a magnitude and the 89% interval is confidently beyond 0
    * the posterior mean for the log mass is even more negative
- the following plot shows this relationship with the size of the point as the kilocalories of the milk

```{r}
dcc %>%
    ggplot(aes(x = log_mass, y = neocortex_perc, size = kcal_per_g)) +
    geom_point()
```

- we can make counterfactual plots for each variable

```{r}
# Counterfactual holding log mass constant.
avg_log_mass <- mean(dcc$log_mass)
neocortex_seq <- seq(50, 80, length.out = 30)
data_n <- tibble(neocortex_perc = neocortex_seq,
                 log_mass = avg_log_mass)
mu_n <- link(m5_7, data = data_n)
mu_n_mean <- apply(mu_n, 2, mean)
mu_n_pi <- apply(mu_n, 2, PI) %>% 
    pi_to_df()

# Counterfactual holding neocortex percent constant.
avg_neopct <- mean(dcc$neocortex_perc)
logmass_seq <- seq(-2.2, 4.5, length.out = 30)
data_m <- tibble(neocortex_perc = avg_neopct,
                 log_mass = logmass_seq)
mu_m <- link(m5_7, data = data_m)
mu_m_mean <- apply(mu_m, 2, mean)
mu_m_pi <- apply(mu_m, 2, PI) %>% 
    pi_to_df()

# Combine data and plot.
bind_rows(
    tibble(counterfactual = "neocortex percent",
           mu_mean = mu_n_mean,
           x = neocortex_seq) %>%
        bind_cols(mu_n_pi),
    tibble(counterfactual = "log mass",
           mu_mean = mu_m_mean,
           x = logmass_seq) %>%
        bind_cols(mu_m_pi)
) %>%
    ggplot(aes(x = x)) +
    facet_wrap(~ counterfactual, scales = "free", nrow = 1) +
    geom_ribbon(aes(ymin = x5_percent, ymax = x94_percent),
                fill = "black", alpha = 0.3, color = "black", lty = 2) +
    geom_line(aes(y = mu_mean),
              color = "black", lty = 1) +
    labs(title = "Counterfactual plots for model 'm5_7'",
         subtitle = "For each plot, the other variable was set constant",
         x = "variable value",
         y = "kilocalorie energy")
```

- these variables only became useful when used together because they both correlate with the outcome, but one is positive and the other is negative
    * also, these two variables are positively correlated
    * therefore, they tend to "mask" each other until both are used in the model

## 5.3 When adding variables hurts


- why not use all of the available predictors for the model?
    1. *multicolinearity*
    2. *post-treatment bias*
    3. *overfitting*

### 5.3.1 Multicollinear legs

- we start with an example: predicting height using length of a persons leg
    * they are correlated, but problems happen when both legs are included in the model
- we simulate 100 individuals
    * height is sampled from a Gaussian
    * each person gets a simulated proportion of height for their legs, ranging from 0.4 to 0.5
    * the second leg is made from the first with a touch of error

```{r}
# Create data.
N <- 100
height <- rnorm(N, 10, 2)
leg_prop <- runif(N, 0.4, 0.5)
leg_left <- leg_prop * height + rnorm(N, 0, 0.02)
leg_right <- leg_prop * height + rnorm(N, 0, 0.02)
d <- tibble(height, leg_left, leg_right)

# Model.
m5_8 <- quap(
    alist(
        height ~ dnorm(mu, sigma),
        mu <- a + bl*leg_left + br*leg_right,
        a ~ dnorm(10, 100),
        bl ~ dnorm(2, 10),
        br ~ dnorm(2, 10),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

summary(m5_8)
```

```{r}
plot(summary(m5_8))
```

- the model fit correctly, remember the question that linear regression answers:
    * "What is the value of knowing each predictor, after already knowing all of the other predictors?"
    * for this example: "What is the value of knowing each leg's length, after already knowing the other leg's length?"
- another way of thinking about this is that we have fitted the following formula
    * there are almost an infinite number of pairs of $\beta_1$ and $\beta_2$ that would produce a given $y$ value for a given $x$

$$
y_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_1 x_i + \beta_2 x_i \\
\mu_i = \alpha + (\beta_1 + \beta_2) x_i = \alpha + \beta_3 x_i
$$

- *when two predictor variables are very strongly correlated, including both in a model may lead to confusion*
- importantly, this model makes fine predictions about height given the length of the left and right legs, it just doesn't say anything about which leg is more important

### 5.3.2 Multicollinear milk

- the leg example above is a bit obvious, but how do we identify this colinear effect in real data
    * for this we return to the milk example
- we will model `kcal_per_g` using `perc_fat` and `perc_lactose` in two bivariate regressions

```{r}
d <- milk %>% as_tibble() %>% janitor::clean_names()

# kilocalories per gram regressed on percent fat
m5_10 <- quap(
    alist(
        kcal_per_g ~ dnorm(mu, sigma),
        mu <- a + bf*perc_fat,
        a ~ dnorm(0.6, 10),
        bf ~ dnorm(0, 1),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

# kilocalories per gram regressed on percent lactose
m5_11 <- quap(
    alist(
        kcal_per_g ~ dnorm(mu, sigma),
        mu <- a + bl*perc_lactose,
        a ~ dnorm(0.6, 10),
        bl ~ dnorm(0, 1),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

precis(m5_10, digits = 3)
precis(m5_11, digits = 3)
```

- summary of `m5_10` and `m5_11`:
    * posterior mean for $\beta_f$ is 0.010 with 89% interval [0.008, 0.012]
    * posterior mean for $\beta_l$ is -0.011 with 89% interval [-0.012, -0.009]
- now we model kilocalories per gram using both predictors

```{r}
# kilocalories per gram regressed on percent fat and percent lactose
m5_12 <- quap(
    alist(
        kcal_per_g ~ dnorm(mu, sigma),
        mu <- a + bf*perc_fat + bl*perc_lactose,
        a ~ dnorm(0.6, 10),
        bf ~ dnorm(0, 1),
        bl ~ dnorm(0, 1),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

precis(m5_10, digits = 3)
```

- summary of `m5_12`:
    * both posterior means for $\beta_f$ and $\beta_l$ are closer to 0
    * the standard deviations of both coefficient's posteriors are larger, too
    * the estimate of $\beta_f$ is effectively 0
- the problem is the same as in the leg example: the percent fat and lactose contain roughly the same information
    * the plot below shows these correlations
    * in the middle-right scatter plot we can see the strong correlation between fat and lactose
    * below that the correlations are shown

```{r}
pairs(~ kcal_per_g + perc_fat + perc_lactose, data = d)
```

```{r}
cor(d$perc_fat, d$perc_lactose)
```

- not only does the correlation between pairs of variables matter, but also the correlation that remains after accounting for any other predictors
- if we run simulations of fitting the model predicting kilocalories per gram with percent fat and a new variable that is correlated with percent fat at different amounts, we can plot the standard deviation of the coefficient of percent fat against the correlation of the two predictors
    * this shows that the effect is not linear
    * instead, as the correlation coefficient increases, the standard deviation of the posterior of the estimate increases exponentially
- it is difficult to identify colinearity, especially with many variates in the model
    * just keep an eye out for large standard deviations of the posteriors
    * one technique for demonstrating multicolinearity is to make a model with each variable and show that the models make nearly the same predictions
- multicolinearity is in a family of problems with fitting models called *non-identifiability*
    * when a parameter is non-identifiable, it means the structure of the data and model do not make it possible to estimate the parameter's value

### 5.3.3 Post-treatment bias

- *omitted variable bias*: mistaken inferences from omitting predictor variables
- *post-treatment bias*: mistaken inferences from including variables that are consequences of other variables
- example with plants:
    * growing some plants in a greenhouse
    * want to know the difference in growth under different antifungal soil treatments
    * plants are grown from seed, their heights measured, treated with antifungals, final measures of heights are taken
    * the presence of fungus is also measured
    * if we are trying to make a causal inference on height, we should not include the presence of fungus
        - the presence of fungus is a *post-treatment effect*
- below is a simulation to help show what goes wrong when included a post-treatment effect
    * probability of fungus was 0.5 without treatment and 0.1 with treatment
    * the final height was the starting height plus a sample from a normal distribution with mean of either 5 or 2 for no treatment or treatment

```{r}
set.seed(123)

# Number of plants
N <- 100
# Simulate initial heights
h0 <- rnorm(N, 10, 2)
# Assign treatments and simulate fungus and growth
treatment <- rep(0:1, each = N/2)
fungus <- rbinom(N, size = 1, prob = 0.5 - treatment * 0.4)
h1 <- h0 + rnorm(N, 5 - 3 * fungus)

# Final data frame
d <- tibble(h0, h1, treatment, fungus)
d
```

- we can model the final height on the starting height, if treated, and if fungus

```{r}
m5_13 <- quap(
    alist(
        h1 ~ dnorm(mu, sigma),
        mu <- a + bh*h0 + bt*treatment + bf*fungus,
        a ~ dnorm(0, 100),
        c(bh, bt, bf) ~ dnorm(0, 10),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

precis(m5_13, digits = 3)
```

- summary of `m5_13`:
    * the coefficient for treatment is basically 0 with the 89% intervals straddling 0
    * the coefficients for starting height and presence of fungus were very strong
- the problem is that we built this data manually and know that treatment should matter
    * however, fungus is mostly a consequence of treatment and should be omitted
    * this is done below and now the impact of treatment is strong as positive, as expected
    
```{r}
m5_14 <- quap(
    alist(
        h1 ~ dnorm(mu, sigma),
        mu <- a + bh*h0 + bt*treatment,
        a ~ dnorm(0, 100),
        c(bh, bt) ~ dnorm(0, 10),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

precis(m5_14, digits = 3)
```

## 5.4 Categorical variables

### 5.4.1 Binary categories

- the simplest case is when the variable has only two cases
- we will use the `Howell1` Kalahari data again
    * the `male` predictor is binary
    * it is an example of a *dummy variable*

```{r}
data("Howell1")
d <- Howell1
skimr::skim(d)
```

- the dummy variable turns a parameter "on" for those cases in the category
- here is the formula for a model predicting height using `male`
    * the parameter $\beta_m$ only influences the cases where $\m_i = 1$ and is canceled out when $\mu_i = 0$ 

$$
h_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i \sim \alpha + \beta_m \mu_i \\
\alpha \sim \text{Normal}(178, 100) \\
\beta_m \sim \text{Normal}(0, 10) \\
\sigma \sim \text{Uniform}(0, 50)
$$

```{r}
m5_15 <- quap(
    alist(
        height ~ dnorm(mu, sigma),
        mu <- a + bm*male,
        a ~ dnorm(178, 100),
        bm ~ dnorm(0, 10),
        sigma ~ dunif(0, 50)
    ),
    data = d
)

summary(m5_15)
```

- summary of `m5_15`:
    * $\alpha$ is now the average height of females:
        - $\mu_i = \alpha + \beta_m (0) = \alpha$
        - therefore, the expected height of the average female is 135 cm
    * the parameter $\beta_m$ is the average difference between males and females
        - therefore, the average male height is 135 cm + 7.27 cm = 142.3 cm
- we can get posterior distribution for male height by sampling $\alpha$ and $\beta_m$ from the model and adding them together
    * we cannot just add the 89% interval values together because $\alpha$ and $\beta_m$ are correlated

```{r}
post <- extract.samples(m5_15)
mu_male <- post$a + post$bm
chainmode(mu_male)
PI(mu_male)
```

### 5.4.2 Many categories

- need one dummy variable per category except for one which will be in the intercept
- we will use the primate milk data for an example

```{r}
d <- as_tibble(milk) %>% janitor::clean_names()
unique(d$clade)
```

- (for now) we will make the dummy variables by hand

```{r}
d$clade_nwm <- as.numeric(d$clade == "New World Monkey")
d$clade_owm <- as.numeric(d$clade == "Old World Monkey")
d$clade_s <- as.numeric(d$clade == "Strepsirrhine")
```

- the model we will fit is below
    * it really defines 4 separate linear models, one for each clade (ape is in $\alpha$)

$$
k_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_{\text{NWM}} \text{NWM}_i + \beta_{\text{OWM}} \text{OWM}_i + \beta_{S} \text{NWM}_i \\
\alpha \sim \text{Normal}(0.6, 10) \\
\beta_{\text{NWM}} \sim \text{Normal}(0, 1) \\
\beta_{\text{OWM}} \sim \text{Normal}(0, 1) \\
\beta_{\text{S}} \sim \text{Normal}(0, 1) \\
\sigma \sim \text{Uniform}(0, 10)
$$

```{r}
m5_16 <- quap(
    alist(
        kcal_per_g ~ dnorm(mu, sigma),
        mu <- a + bNWM * clade_nwm + bOWM * clade_owm + bS * clade_s,
        a ~ dnorm(0.6, 10),
        c(bNWM, bOWM, bS) ~ dnorm(0, 1),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

precis(m5_16, digits = 4)
```

- sample from the posteriors to get the posteriors for the heights of each ape

```{r}
post <- extract.samples(m5_16)

mu_ape <- post$a
mu_nwm <- post$a + post$bNWM
mu_owm <- post$a + post$bOWM
mu_s <- post$a + post$bS

precis(tibble(mu_ape, mu_nwm, mu_owm, mu_s))
```

- additional comparisons between the clades can be made with these estimates
    * we can estimate the difference between new world and old world monkeys

```{r}
diff_nwm_old <- mu_nwm - mu_owm
quantile(diff_nwm_old, probs = c(0.025, 0.5, 0.975))
```

### 5.4.3 Adding regular predictor variables

- can still include other predictor variables in the model with the dummy variables for clade
    * here is the model for the clades and percent fat ($\beta_F F_i$)

$$
\mu_i = \alpha + \beta_{\text{NWM}} \text{NWM}_i + \beta_{\text{OWM}} \text{OWM}_i + \beta_{S} \text{NWM}_i + \beta_F F_i \\
$$

## 5.5 Ordinary least squares and lm

- *ordinary least squares*: (OLS) way of estimating the parameters of linear regression
    * tries to minimize the sum of squared residuals
    * generally similar to maximizing the posterior probability or maximizing the likelihood
- we will explore `lm()` in R and understand the estimates of OLS from a Bayesian perspective

### 5.5.1 Design formulas

- use *design formula* for `lm()`, separating predictors by `+`
    * the coefficients get created automatically
    * all priors are flat (uninformative)
- linear regression below is modeled in R shown in the following design formula

$$
y_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta x_i
$$

```r
y ~ 1 + x
```

- for a model with multiple predictors, the design formula looks like this:

```r
y ~ 1 + x + z + w
```

### 5.5.2 Using lm

- just provide the design formula and data to `lm()`

```r
m5_17 <- lm(y ~ 1 + x, data = d)
m5_18 <- lm(y + 1 + x + z + w, data = d)
```

- the parameter estimates will be named after the predictors
- the following subsections explain a few quirks of `lm()`

#### 5.5.2.1 Intercepts are optional

- need not include the explicit `1` in the design formula
- it can be omitted by instead including a `0` or `-1` parameter

```r
m5_20 <- lm(y ~ 0 + x, data = d)
m5_21 <- lm(y ~ x - 1, data = d)
```

#### 5.5.2.2 Categorical variables

- categorical variables are automatically expanded into the necessary dummy variables
    * if the categorical variable is encoded with numbers, you must change it to a factor (or character)

```r
m5_22 <- lm(y ~ 1 + as.factor(season), data = d)
```

#### 5.5.2.3 Transform variables first

- `lm()` can be confused by transformations to the data included in the model, so it is usually best to do these before hand
    * e.g.: log, square, cube, center, scale 

#### 5.5.2.4 No estimate for $\sigma$

- `lm()` does not provide a posterior for the standard deviation $\sigma$
    * it can provide the "residual standard error"
        - can be an estimate of $\sigma$ but without any uncertainty in this value

### 5.5.3 Building map (quap) formulas from lm formulas

- the 'rethinking' package has the function `glimmer()` to turn a design formula into a quap-style model formula

```{r}
glimmer(dist ~ speed, data = cars)
```

## 5.7 Practice

### Easy

**5E1. Which of the linear models below are multiple linear regressions?**

$\text{(2)} \quad \mu_i = \beta_x x_i + \beta_z z_i$  
$\text{(4)} \quad \mu_i = \alpha + \beta_x x_i + \beta_z z_i$

**5E2. Write down a multiple regression to evaluate the claim: Animal diversity is linearly related to latitude, but only after controlling for plant diversity. You just need to write down the model definition.**

$$
A_i \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i = \alpha + \beta_L x_L + \beta_P x_P
$$

**5E3. Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree. Write down the model definition and indicate which side of zero each slope parameter should be on.**

$$
T_i \sim \text{Normal}(\mu, \sigma) \\
\mu_i = \alpha + \beta_F x_F + \beta_S x_S
$$

Both coefficients ($\beta_F$ and $\beta_S$) should be positive.

**5E4. Suppose you have a single categorical predictor with 4 levels (unique values), labeled A, B, C and D. Let $A_i$ be an indicator variable that is 1 where case $i$ is in category A. Also suppose $B_i$, $C_i$, and $D_i$ for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model.**

(Refer to the text to see which models these numbers refer to.)

Models 1, 3, 4 and 5 are inferentially equivalent.

### Medium

**5M1. Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).**

```{r}
set.seed(0)

N <- 100  # number of samples
x1 <- rnorm(N, 10, 1)
x2 <- x1 + 1 * rnorm(N, 1, 1)

plot(x1, x2)

y <- x1 + x2 + runif(N, 0, 3)

d <- tibble(x1, x2, y)
pairs(d)

# Just model y with x1
m5_23 <- quap(
    alist(
        y ~ dnorm(mu, sigma),
        mu <- a + b1*x1,
        a ~ dnorm(0, 20),
        b1 ~ dnorm(0, 50),
        sigma ~ dunif(0, 20)
    ),
    data = d
)

# Model with both predictors.
m5_24 <- quap(
    alist(
        y ~ dnorm(mu, sigma),
        mu <- a + b1*x1 + b2*x2,
        a ~ dnorm(0, 20),
        c(b1, b2) ~ dnorm(0, 50),
        sigma ~ dunif(0, 20)
    ),
    data = d
)

precis(m5_23)
precis(m5_24)
```

**5M2. Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another.**

```{r}
set.seed(0)
N <- 100

x1 <- rnorm(N, 10, 1)
x2 <- -x1 * rnorm(N, mean = 1, sd = 0.1)
plot(x1, x2)

y <- x1 + x2 + runif(N, 0.5, 1.5)

d <- tibble(x1, x2, y)
pairs(d)

# Just model y ~ x1
m5_25 <- quap(
    alist(
        y ~ dnorm(mu, sigma),
        mu <- a + b1*x1,
        a ~ dnorm(0, 20),
        b1 ~ dnorm(0, 50),
        sigma ~ dunif(0, 20)
    ),
    data = d
)

# Model with both predictors, unmasking the relationship
m5_26 <- quap(
    alist(
        y ~ dnorm(mu, sigma),
        mu <- a + b1*x1 + b2*x2,
        a ~ dnorm(0, 20),
        c(b1, b2) ~ dnorm(0, 50),
        sigma ~ dunif(0, 20)
    ),
    data = d
)

precis(m5_25)
precis(m5_26)
```

**5M3. It is sometimes observed that the best predictor of fire risk is the presence of firefighters—States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?**

Higher divorce rate could cause a higher marriage rate because it gives each person more opportunities to get married.
Also, people who marry first when they are younger have more time to get multiple marriages afterwards.
Finally, population can also be a factor here as that offers more chances to marry someone.
We can try modeling this relationship using divorce rate, population, and median age at marriage

```{r}
d <- as_tibble(WaffleDivorce) %>% 
    janitor::clean_names() %>%
    mutate_if(is.factor, as.character) %>%
    mutate(pop_log = log(population),
           pop_log_std = (pop_log - mean(pop_log)) / sd(pop_log),
           age_std = (median_age_marriage - mean(median_age_marriage)) / sd(median_age_marriage),
           divorce_std = (divorce - mean(divorce)) / sd(divorce))

pairs(~ marriage + pop_log_std + age_std + divorce_std, data = d)

m5_27 <- quap(
    alist(
        marriage ~ dnorm(mu, sigma),
        mu <- a + bp*pop_log_std + ba*age_std + bd*divorce_std,
        a ~ dnorm(0, 50),
        c(bp, ba, bd) ~ dnorm(0, 10),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

precis(m5_27)
```

**5M4. In the divorce data, States with high numbers of Mormons (members of The Church of Jesus Christ of Latter-day Saints, LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable.**

I got my statistics from the *Wikipedia* article ["The Church of Jesus Christ of Latter-day Saints membership statistics (United States)"](https://en.wikipedia.org/wiki/The_Church_of_Jesus_Christ_of_Latter-day_Saints_membership_statistics_(United_States)) (05/06/2020).

```{r}
lds_data <- file.path("assets", "ch5", "mormon-population-stats.tsv") %>%
    read_tsv(col_types = cols(.default = "c")) %>%
    janitor::clean_names() %>%
    select(state, perc_lds = lds) %>%
    mutate(perc_lds = as.numeric(str_remove(perc_lds, "\\%")),
           perc_lds_std = (perc_lds - mean(perc_lds)) / sd(perc_lds)) %>%
    inner_join(d, by = c("state" = "location")) %>%
    mutate(marriage_std = (marriage - mean(marriage)) / sd(marriage))

m5_28 <- quap(
    alist(
        divorce ~ dnorm(mu, sigma),
        mu <- a + bm*marriage_std + ba*age_std + bl*perc_lds_std,
        a ~ dnorm(0, 50),
        c(bm, ba, bl) ~ dnorm(0, 10),
        sigma ~ dunif(0, 10)
    ),
    data = lds_data
)

precis(m5_28)
```

**5M5. One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.**

I would begin by modeling the obesity rate on just the about of exercise and number of restaurant meals per month (or some other period) to see if they directly predict obesity.
Further, I would model obesity with both of these predictors as they could be negatively correlated with each other, causing a masked relationship.
I would then look at the counterfactual plots and residuals to see if either predictor is more influential than the other.

### Hard
**All three exercises below use the same data, `data(foxes)` (part of rethinking).81 The urban fox (Vulpes vulpes) is a successful exploiter of human habitat. Since urban foxes move in packs and defend territories, data on habitat quality and population density is also included. The data frame has five columns:**

| group     | Number of the social group the individual fox belongs to |
|-----------|----------------------------------------------------------|
| avgfood   | The average amount of food available in the territory    |
| groupsize | The number of foxes in the social group                  |
| area      | Size of the territory                                    |
| weight    | Body weight of the individual fox                        |

```{r}
data("foxes")
d <- as_tibble(foxes)
```

**5H1. Fit two bivariate Gaussian regressions, using map: (1) body weight as a linear function of territory size (area), and (2) body weight as a linear function of groupsize. Plot the results of these regressions, displaying the MAP regression line and the 95% interval of the mean. Is either variable important for predicting fox body weight?**

```{r}
d <- d %>%
    mutate(area_std = scale_nums(area),
           groupsize_std = scale_nums(groupsize))

m5_29 <- quap(
    alist(
        weight ~ dnorm(mu, sigma),
        mu <- a + ba*area_std,
        a ~ dnorm(0, 10),
        ba ~ dnorm(0, 10),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

m5_30 <- quap(
    alist(
        weight ~ dnorm(mu, sigma),
        mu <- a + bg*groupsize_std,
        a ~ dnorm(0, 10),
        bg ~ dnorm(0, 10),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

precis(m5_29, digits = 3)
precis(m5_30, digits = 3)

mu_area <- link(m5_29)
mu_area_avg <- apply(mu_area, 2, mean)
mu_area_pi <- apply(mu_area, 2, PI, prob = 0.95) %>% pi_to_df()
d %>% 
    mutate(mu_avg = mu_area_avg) %>%
    bind_cols(mu_area_pi) %>%
    ggplot(aes(x = area_std)) +
    geom_ribbon(aes(ymin = x3_percent, ymax = x98_percent),
                fill = "black", color = NA, alpha = 0.3) +
    geom_line(aes(y = mu_avg), color = "black", size = 1.1) +
    geom_point(aes(y = weight), size = 1, color = "grey50") +
    labs(title = "Predicting fox weight by area of territory",
         subtitle = "Ribbon is the 95% interval of the estimated average of weight",
         y = "weight")

mu_size <- link(m5_30)
mu_size_avg <- apply(mu_size, 2, mean)
mu_size_pi <- apply(mu_size, 2, PI, prob = 0.95) %>% pi_to_df()
d %>% 
    mutate(mu_avg = mu_size_avg) %>%
    bind_cols(mu_size_pi) %>%
    ggplot(aes(x = groupsize_std)) +
    geom_ribbon(aes(ymin = x3_percent, ymax = x98_percent),
                fill = "black", color = NA, alpha = 0.3) +
    geom_line(aes(y = mu_avg), color = "black", size = 1.1) +
    geom_point(aes(y = weight), size = 1, color = "grey50") +
    labs(title = "Predicting fox weight by group size",
         subtitle = "Ribbon is the 95% interval of the estimated average of weight",
         y = "weight")
```

On its own, area of the territory does not seem to be an informative predictor of weight.
On its own, group size has a very slight negative association with weight.

**5H2. Now fit a multiple linear regression with weight as the outcome and both area and groupsize as predictor variables. Plot the predictions of the model for each predictor, holding the other predictor constant at its mean. What does this model say about the importance of each variable? Why do you get different results than you got in the exercise just above?**

```{r}
m5_31 <- quap(
    alist(
        weight ~ dnorm(mu, sigma),
        mu <- a + ba*area_std + bg*groupsize_std,
        a ~ dnorm(0, 10),
        ba ~ dnorm(0, 10),
        bg ~ dnorm(0, 10),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

precis(m5_31, digits = 3)

area_data <- data.frame(area_std = seq(-2.3, 2.1, length.out = 100),
                        groupsize_std = mean(d$groupsize_std))
mu_area <- link(m5_31, data = area_data)
area_data %>%
    mutate(mu_avg = apply(mu_area, 2, mean)) %>%
    bind_cols(apply(mu_area, 2, PI) %>% pi_to_df()) %>%
    ggplot(aes(x = area_std)) +
    geom_ribbon(aes(ymin = x5_percent, ymax = x94_percent),
                color = NA, fill = "black", alpha = 0.3) +
    geom_line(aes(y = mu_avg), lty = 2, size = 1.1) +
    labs(title = "Counterfactual of effect of territory area on weight, holding groupsize constant",
         subtitle = "The ribbon is the 89% interval on the estimate of the mean weight",
         x = "area (standardized)", y = "estimate of weight")

size_data <- data.frame(area_std = mean(d$area_std),
                        groupsize_std = seq(-1.6, 2.4, length.out = 100))
mu_size <- link(m5_31, data = size_data)
size_data %>%
    mutate(mu_avg = apply(mu_size, 2, mean)) %>%
    bind_cols(apply(mu_size, 2, PI) %>% pi_to_df()) %>%
    ggplot(aes(x = groupsize_std)) +
    geom_ribbon(aes(ymin = x5_percent, ymax = x94_percent),
                color = NA, fill = "black", alpha = 0.3) +
    geom_line(aes(y = mu_avg), lty = 2, size = 1.1) +
    labs(title = "Counterfactual of effect of groupsize on weight, holding territory area constant",
         subtitle = "The ribbon is the 89% interval on the estimate of the mean weight",
         x = "group size (standardized)", y = "estimate of weight")
```

Now each variable is quite important with the territory area as a positive indicator of weight and group size as a negative indicator.
They are negatively correlated with each other, thus masking their individual contributions to weight in individual models.

**5H3. Finally, consider the avgfood variable. Fit two more multiple regressions: (1) body weight as an additive function of avgfood and groupsize, and (2) body weight as an additive function of all three variables, avgfood and groupsize and area. Compare the results of these models to the previous models you've fit, in the first two exercises. (a) Is avgfood or area a better predictor of body weight? If you had to choose one or the other to include in a model, which would it be? Support your assessment with any tables or plots you choose. (b) When both avgfood or area are in the same model, their effects are reduced (closer to zero) and their standard errors are larger than when they are included in separate models. Can you explain this result?**

```{r}
pairs(~ groupsize + area + avgfood + weight, data = d)

d <- d %>% mutate(avgfood_std = scale_nums(avgfood))

m5_32 <- quap(
    alist(
        weight ~ dnorm(mu, sigma),
        mu <- a + bf*avgfood_std + bg*groupsize_std,
        a ~ dnorm(0, 10),
        bf ~ dnorm(0, 10),
        bg ~ dnorm(0, 10),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

m5_33 <- quap(
    alist(
        weight ~ dnorm(mu, sigma),
        mu <- a + ba*area_std + bf*avgfood_std + bg*groupsize_std,
        a ~ dnorm(0, 10),
        ba ~ dnorm(0, 10),
        bf ~ dnorm(0, 10),
        bg ~ dnorm(0, 10),
        sigma ~ dunif(0, 10)
    ),
    data = d
)

precis(m5_32, digits = 3)
precis(m5_33, digits = 3)
```

Average food seems to be a better predictor of weight than area of the territory when included in a model with the group size as another predictor.

I would choose average food to be in the model as it is likely to have a more direct impact on weight than the size of the territory.
This is assuming that it is not a post-treatment variable; I would need to know how the data was collected to accurately address that concern.

Average food and area are highly correlated (shown below and in the pairs plot above), therefore, when included in the same model, their estimates are weakened due to shared information.

```{r}
cor(d$avgfood, d$area)
```