app04_simdr.qmd

# Doubly Robust Simulation {#mc_standdr}

```{r }
library(tidyr, quietly = TRUE)
library(dplyr, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(gt, quietly = TRUE)
library(geepack, quietly = TRUE)
library(MonteCarlo, quietly = TRUE)
```

```{r include=FALSE}
# directory of data files
dir_data <- file.path(getwd(), "data")
# directory for functions
dir_lib <- file.path(getwd(), "lib")
```

```{r }
#| include: false
source(file = file.path(dir_lib, "gt_utils.R"), 
       local = knitr::knit_global())
source(file = file.path(dir_lib, "gt_measures.R"), 
       local = knitr::knit_global())
source(file = file.path(dir_lib, "fci_06-H_mc_standdr.R"), 
       local = knitr::knit_global())
source(file = file.path(dir_lib, "fci_06-I_simdr.R"), 
       local = knitr::knit_global())
```

## Analyse $T = \sum_{i=1}^J{T_i}$ (Errata) {#errata6a}

Section 6.3 claims that $T=600$ and $P(T=1 \mid H) \in [0.041, 0.0468]$ but the results obtained from running `simdr` as shown in the book are different. Not a lot but enough to puzzle the avid reader (like me for example).

So lets go through `simdr` to explain the differences.

```{r}
simdr.out <- simdr(seed = 1009)$stats
```

> One might think that standardization with the exposure model would be preferable when the outcome indicates a rare condition. To see this, first suppose the condition is not rare. We might have 3000 individuals and 50%, or 1500, with the condition. Using the rule of thumb for logistic regression of Peduzzi et al. presented in chapter 2, we should be able to include 150 covariates in the outcome model.

The Peduzzi rule can be found at the end of section 2.3, on top of p. 29. where it states that

> both the numbers of individuals with $Y=0$ and $Y=1$ need to be larger than ten times the number of parameters.

and therefore

> Now suppose the exposure, $T$, is divided more evenly: that is, we have 600 with $T=1$. This would suggest we can include 60 covariates in the exposure model.

Brumback uses a linear function of $H$ to create a distribution of $T_i$ that makes it dependent on $H$ and which should give $\sum T_i \approx 600$. However, *the simulation with `simdr` returns* $\sum T_i \approx 540$.

```{r}
# sum of T
simdr.out$`T`$sum
```

::: {#errata1 .greeting .message style="color: red;"}
We observe that $\sum_i T_i \approx 540 \not \approx 600$.
:::

This is the part that needs explaining. The difference seems too large to be explained by the usual culprit, random process.

To be able to do that lets use some notations and go through the mechanics of `simdr`.

Let $H_{j}$ be the covariate $j$

$$
\begin{align*}
H_j \text{ i.i.d. } \mathcal{Bernoulli}(p), \, j = 1, \ldots, J
\end{align*}
$$

where $J$ is the same variable as $ss$ from Brumback, i.e. $J = ss = 100$. 
Also $H_{i, j}$ is the value of covariate $H_j$ for individual $i$.

and let $T_i$ be the treatment of individual $i$

$$
\begin{align*}
T_i \sim \mathcal{Bernoulli}(prob = P_i), \, i = 1, \ldots, I
\end{align*}
$$

where $I$ is set at $I=3000$ by Brumback.

In addition, we let

$$
S_i = \sum_{j=1}^{J}H_{i, j} \\
\text{where } J = ss \text{ as mentioned above}
$$

and `simdr` defines $P_i$ as a random variable

$$
\begin{align*}
&P_i = \alpha \frac{\beta}{J} S_i + p \times X_i \\
\\
&\text{where} \\
&\alpha = 0.13 \\
&\beta = 20 \\ 
&p = 0.05 \\
&X_i \sim \mathcal{Normal}(mean = 1, sd = 0.1)
\end{align*}
$$

now, since

$$
\begin{align*}
H_i \text{ i.i.d. } \mathcal{Bernoulli}(p) &\implies E(H_i)=p \\
T_i \sim \mathcal{Bernoulli}(p_{i}) &\implies E(T_i) = E(P_i) \\
X_i \sim \mathcal{Normal}(mean = 1, sd = 0.1) &\implies E(X_i) = 1
\end{align*}
$$

and since $E()$ is a linear function then

$$
\begin{align*}
E(T_i) &= E(P_i) \\
&= E(\alpha \cdot \frac{\beta}{J} S_i + p \cdot X_i) \\
&= \alpha \cdot \frac{\beta}{J} \sum_{j=1}^{J}E(H_{i, j}) + p \cdot E(X_i) \\
&= \alpha \cdot \frac{\beta}{J} \sum_{j=1}^{J}p + p \cdot 1 \\
&= \alpha \cdot \frac{\beta}{J} \cdot J \cdot p + p \\
&= \alpha \cdot \beta \cdot p + p
\end{align*}
$$

and the sum of the number of people treated is $T = \sum_i^IT_i$

$$
\begin{align*}
E(T)  &= \sum_{i=1}^I{E(T_i)} \\
&= \sum_{i=1}^I{(\alpha \cdot \beta \cdot p + p)} \\
&= I (\alpha \cdot \beta \cdot p + p) \\
\\
&\text{and since} \\ 
&I = 3000, \, \alpha = 0.13, \, \beta = 20, \, p = 0.05 \\
\\
&\text{then} \\
E(T) &= 3000 \cdot (0.13 \cdot 20 \cdot 0.05 + 0.05) \\
&= 540
\end{align*}
$$

which proves that the expected number of $T$ is not really close to 600 but rather 540 which is also the result of the simulation. Therefore we could use 54 covariates rather than 60 by the Peduzzi rule.

We note that this *does not change the conclusions drawn from the simulation significantly*.

Finally, we note that the formula $\alpha \cdot \beta \cdot p + p$ could be modified to get $T=600$ by varying the coefficient $\alpha=0.13$ and/or $\beta=20$. If we change only $\alpha$ and keep $\beta=20$ we could use the result from above and with simple algebra

$$
\begin{align*}
E(T)&= I (\alpha \cdot \beta \cdot p + p) \\
\\
&\text{and since we want } E(T) = 600 \\
&\text{and that} \\
&\beta = 20, \, I = 3000, \, p = 0.05 \\
\\
\text{then} \\
600 &= 3000 \cdot (\alpha \cdot 20 \cdot 0.05 + 0.05) \\
\alpha &= \frac{600 - 0.05 \cdot 3000}{3000 \cdot 20 \cdot0.05} = \frac{450}{3000} = 0.15
\end{align*}
$$

and we simulate using $\alpha = 0.15$ to validate

```{r}
simdr600.out <- simdr(alpha = 0.15, seed = 1009)$stats
simdr600.out$`T`$sum
```

::: {#suggest1 .greeting .message style="color: green;"}
We suggest that `simdr` be modified to **use 0.15 instead of 0.13**.
:::

## Analyse $Y = \sum_{i=1}^I Y_i$

First we analyse $Y = \sum Y_i$ mathematically using the same notation as above.

$$
\begin{align*}
&\text{as defined in simdr} \\
Y_i &= 0.1 T_i + 0.1 \frac{\beta}{J} S_i \\
\\
&\text{therefore} \\
E(Y_i) &=0.1 E(T_i) + 0.1 \frac{\beta}{J} E(S_i) \\
&= 0.01 E(P_i) + 0.01 \frac{\beta}{J} \sum_{j=1}^J{E(H_{ij})} \\
&= 0.01 \cdot(\alpha \cdot \beta \cdot p + p) + 0.01 \sum_{j=1}^J{p} \\
&= 0.01 \cdot(\alpha \cdot \beta \cdot p + p) + 0.01 \frac{\beta}{J} \cdot J \cdot p\\
&= 0.01 \cdot(\alpha \cdot \beta \cdot p + p) + 0.01 \cdot \beta \cdot p \\
\\
&\text{and given} \\
&\alpha = 0.13, \, \beta = 20, \, p = 0.05 \\
\\
&\text{therefore} \\
E(Y_i) &= 0.01 \cdot(0.13 \cdot 20 \cdot 0.05 + 0.05) + 0.01 \cdot 20 \cdot 0.05 \\
&=0.0013 + 0.0005 + 0.01 \\
&= 0.0118
\end{align*}
$$

and since $Y = \sum_{i=1}^I Y_i$ then

$$
\begin{align*}
Y &= \sum_{i=1}^I Y_i \\
E(Y) &= \sum_{i=1}^I E(Y_i) = I \cdot E(Y_i) \\
\\
&\text{and from above} \\
E(Y_i) &= 0.0118 \\
\\
&\text{therefore} \\
E(Y) &= I \cdot E(Y_i) = 3000 \cdot 0.0118 = 35.4 \\
&\approx 35
\end{align*}
$$

which mathematically proves the following

> We simulated $Y_i$ as a function of $T_i$ and $\sum_{k=1}^{ss}H_{i,k}$, such that approximately 35 individuals had $Y=1$.

and the result we get from both `simdr` is 38, close enough.

```{r}
simdr.out$Y$sum
```

> The mean of $\sum_{k=1}^{ss}{H_{ik}}$ was fixed at one, but when $ss$ was set to 100, it ranged from 0.00 to 2.80.

That is exactly what we get.

```{r}
c("min" = simdr.out$sumH$min, "max" = simdr.out$sumH$max)
```

then

> $P(T = 1 \mid H)$ ranged from 0.041 to 0.468.

::: {#errata2 .greeting .message style="color: red;"}
The results are different.
:::

```{r}
c("min" = simdr.out$probT$min, "max" = simdr.out$probT$max)
```

Probably because the original simulation came up with $T=600$ rather than $T=540$ as discussed above. Indeed if we verify with the simulation with $T=600$ the results are now reasonably close.

```{r}
c("min" = simdr600.out$probT$min, "max" = simdr600.out$probT$max)
```

> $E(Y \mid T, H)$ ranged from 0.000 to 0.036.

and the results are almost identical

```{r}
c("min" = simdr.out$probY$min, "max" = simdr.out$probY$max)
```

## Functions

The function `simdr` is found on p. 127-128. See the script in the last section of this appendix. We just added some arguments and and output to facilitate the analysis. See the next section *Analysis*.

The Monte Carlo simulation is done with a non-parametric Monte Carlo function `mc_standdr` with the following script. See the section *Simulation* blow.

Note that `simdr` was rewritten as function `mc_standdr` to

-   facilitate the analysis of the algorithm
-   perform a Monte Carlo simulation with the `MonteCarlo` package
-   separate the data simulation from the measurement of estimates in 2 sub functions
    -   `standdr_sim`: Simulate the data
    -   `standdr_est`: Calculate the estimates using the simulated data from `standdr_sim`

`mc_standdr` and `simdr` give exactly the same results for the simulatted data

```{r}
simdr.out <- simdr(seed = 1009)
new_simdr.out <- standdr_sim(seed = 1009)
stopifnot(identical(simdr.out$stats, new_simdr.out$stats))
```

as well as the measurements

```{r}
new_simdr.est <- standdr_est(Y = new_simdr.out$data$Y,
                             `T` = new_simdr.out$data$`T`,
                             H = new_simdr.out$data$H)
stopifnot(identical(simdr.out$est, new_simdr.est))
```

## Scripts

### `simdr`

```{r}
#| file: "lib\\fci_06-I_simdr.R"
```

### `mc_standdr`

```{r}
#| file: "lib\\fci_06-H_mc_standdr.R"

```