-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathapp04_simdr.qmd
309 lines (233 loc) · 9 KB
/
app04_simdr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
# Doubly Robust Simulation {#mc_standdr}
```{r }
library(tidyr, quietly = TRUE)
library(dplyr, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(gt, quietly = TRUE)
library(geepack, quietly = TRUE)
library(MonteCarlo, quietly = TRUE)
```
```{r include=FALSE}
# directory of data files
dir_data <- file.path(getwd(), "data")
# directory for functions
dir_lib <- file.path(getwd(), "lib")
```
```{r }
#| include: false
source(file = file.path(dir_lib, "gt_utils.R"),
local = knitr::knit_global())
source(file = file.path(dir_lib, "gt_measures.R"),
local = knitr::knit_global())
source(file = file.path(dir_lib, "fci_06-H_mc_standdr.R"),
local = knitr::knit_global())
source(file = file.path(dir_lib, "fci_06-I_simdr.R"),
local = knitr::knit_global())
```
## Analyse $T = \sum_{i=1}^J{T_i}$ (Errata) {#errata6a}
Section 6.3 claims that $T=600$ and $P(T=1 \mid H) \in [0.041, 0.0468]$ but the results obtained from running `simdr` as shown in the book are different. Not a lot but enough to puzzle the avid reader (like me for example).
So lets go through `simdr` to explain the differences.
```{r}
simdr.out <- simdr(seed = 1009)$stats
```
> One might think that standardization with the exposure model would be preferable when the outcome indicates a rare condition. To see this, first suppose the condition is not rare. We might have 3000 individuals and 50%, or 1500, with the condition. Using the rule of thumb for logistic regression of Peduzzi et al. presented in chapter 2, we should be able to include 150 covariates in the outcome model.
The Peduzzi rule can be found at the end of section 2.3, on top of p. 29. where it states that
> both the numbers of individuals with $Y=0$ and $Y=1$ need to be larger than ten times the number of parameters.
and therefore
> Now suppose the exposure, $T$, is divided more evenly: that is, we have 600 with $T=1$. This would suggest we can include 60 covariates in the exposure model.
Brumback uses a linear function of $H$ to create a distribution of $T_i$ that makes it dependent on $H$ and which should give $\sum T_i \approx 600$. However, *the simulation with `simdr` returns* $\sum T_i \approx 540$.
```{r}
# sum of T
simdr.out$`T`$sum
```
::: {#errata1 .greeting .message style="color: red;"}
We observe that $\sum_i T_i \approx 540 \not \approx 600$.
:::
This is the part that needs explaining. The difference seems too large to be explained by the usual culprit, random process.
To be able to do that lets use some notations and go through the mechanics of `simdr`.
Let $H_{j}$ be the covariate $j$
$$
\begin{align*}
H_j \text{ i.i.d. } \mathcal{Bernoulli}(p), \, j = 1, \ldots, J
\end{align*}
$$
where $J$ is the same variable as $ss$ from Brumback, i.e. $J = ss = 100$.
Also $H_{i, j}$ is the value of covariate $H_j$ for individual $i$.
and let $T_i$ be the treatment of individual $i$
$$
\begin{align*}
T_i \sim \mathcal{Bernoulli}(prob = P_i), \, i = 1, \ldots, I
\end{align*}
$$
where $I$ is set at $I=3000$ by Brumback.
In addition, we let
$$
S_i = \sum_{j=1}^{J}H_{i, j} \\
\text{where } J = ss \text{ as mentioned above}
$$
and `simdr` defines $P_i$ as a random variable
$$
\begin{align*}
&P_i = \alpha \frac{\beta}{J} S_i + p \times X_i \\
\\
&\text{where} \\
&\alpha = 0.13 \\
&\beta = 20 \\
&p = 0.05 \\
&X_i \sim \mathcal{Normal}(mean = 1, sd = 0.1)
\end{align*}
$$
now, since
$$
\begin{align*}
H_i \text{ i.i.d. } \mathcal{Bernoulli}(p) &\implies E(H_i)=p \\
T_i \sim \mathcal{Bernoulli}(p_{i}) &\implies E(T_i) = E(P_i) \\
X_i \sim \mathcal{Normal}(mean = 1, sd = 0.1) &\implies E(X_i) = 1
\end{align*}
$$
and since $E()$ is a linear function then
$$
\begin{align*}
E(T_i) &= E(P_i) \\
&= E(\alpha \cdot \frac{\beta}{J} S_i + p \cdot X_i) \\
&= \alpha \cdot \frac{\beta}{J} \sum_{j=1}^{J}E(H_{i, j}) + p \cdot E(X_i) \\
&= \alpha \cdot \frac{\beta}{J} \sum_{j=1}^{J}p + p \cdot 1 \\
&= \alpha \cdot \frac{\beta}{J} \cdot J \cdot p + p \\
&= \alpha \cdot \beta \cdot p + p
\end{align*}
$$
and the sum of the number of people treated is $T = \sum_i^IT_i$
$$
\begin{align*}
E(T) &= \sum_{i=1}^I{E(T_i)} \\
&= \sum_{i=1}^I{(\alpha \cdot \beta \cdot p + p)} \\
&= I (\alpha \cdot \beta \cdot p + p) \\
\\
&\text{and since} \\
&I = 3000, \, \alpha = 0.13, \, \beta = 20, \, p = 0.05 \\
\\
&\text{then} \\
E(T) &= 3000 \cdot (0.13 \cdot 20 \cdot 0.05 + 0.05) \\
&= 540
\end{align*}
$$
which proves that the expected number of $T$ is not really close to 600 but rather 540 which is also the result of the simulation. Therefore we could use 54 covariates rather than 60 by the Peduzzi rule.
We note that this *does not change the conclusions drawn from the simulation significantly*.
Finally, we note that the formula $\alpha \cdot \beta \cdot p + p$ could be modified to get $T=600$ by varying the coefficient $\alpha=0.13$ and/or $\beta=20$. If we change only $\alpha$ and keep $\beta=20$ we could use the result from above and with simple algebra
$$
\begin{align*}
E(T)&= I (\alpha \cdot \beta \cdot p + p) \\
\\
&\text{and since we want } E(T) = 600 \\
&\text{and that} \\
&\beta = 20, \, I = 3000, \, p = 0.05 \\
\\
\text{then} \\
600 &= 3000 \cdot (\alpha \cdot 20 \cdot 0.05 + 0.05) \\
\alpha &= \frac{600 - 0.05 \cdot 3000}{3000 \cdot 20 \cdot0.05} = \frac{450}{3000} = 0.15
\end{align*}
$$
and we simulate using $\alpha = 0.15$ to validate
```{r}
simdr600.out <- simdr(alpha = 0.15, seed = 1009)$stats
simdr600.out$`T`$sum
```
::: {#suggest1 .greeting .message style="color: green;"}
We suggest that `simdr` be modified to **use 0.15 instead of 0.13**.
:::
## Analyse $Y = \sum_{i=1}^I Y_i$
First we analyse $Y = \sum Y_i$ mathematically using the same notation as above.
$$
\begin{align*}
&\text{as defined in simdr} \\
Y_i &= 0.1 T_i + 0.1 \frac{\beta}{J} S_i \\
\\
&\text{therefore} \\
E(Y_i) &=0.1 E(T_i) + 0.1 \frac{\beta}{J} E(S_i) \\
&= 0.01 E(P_i) + 0.01 \frac{\beta}{J} \sum_{j=1}^J{E(H_{ij})} \\
&= 0.01 \cdot(\alpha \cdot \beta \cdot p + p) + 0.01 \sum_{j=1}^J{p} \\
&= 0.01 \cdot(\alpha \cdot \beta \cdot p + p) + 0.01 \frac{\beta}{J} \cdot J \cdot p\\
&= 0.01 \cdot(\alpha \cdot \beta \cdot p + p) + 0.01 \cdot \beta \cdot p \\
\\
&\text{and given} \\
&\alpha = 0.13, \, \beta = 20, \, p = 0.05 \\
\\
&\text{therefore} \\
E(Y_i) &= 0.01 \cdot(0.13 \cdot 20 \cdot 0.05 + 0.05) + 0.01 \cdot 20 \cdot 0.05 \\
&=0.0013 + 0.0005 + 0.01 \\
&= 0.0118
\end{align*}
$$
and since $Y = \sum_{i=1}^I Y_i$ then
$$
\begin{align*}
Y &= \sum_{i=1}^I Y_i \\
E(Y) &= \sum_{i=1}^I E(Y_i) = I \cdot E(Y_i) \\
\\
&\text{and from above} \\
E(Y_i) &= 0.0118 \\
\\
&\text{therefore} \\
E(Y) &= I \cdot E(Y_i) = 3000 \cdot 0.0118 = 35.4 \\
&\approx 35
\end{align*}
$$
which mathematically proves the following
> We simulated $Y_i$ as a function of $T_i$ and $\sum_{k=1}^{ss}H_{i,k}$, such that approximately 35 individuals had $Y=1$.
and the result we get from both `simdr` is 38, close enough.
```{r}
simdr.out$Y$sum
```
> The mean of $\sum_{k=1}^{ss}{H_{ik}}$ was fixed at one, but when $ss$ was set to 100, it ranged from 0.00 to 2.80.
That is exactly what we get.
```{r}
c("min" = simdr.out$sumH$min, "max" = simdr.out$sumH$max)
```
then
> $P(T = 1 \mid H)$ ranged from 0.041 to 0.468.
::: {#errata2 .greeting .message style="color: red;"}
The results are different.
:::
```{r}
c("min" = simdr.out$probT$min, "max" = simdr.out$probT$max)
```
Probably because the original simulation came up with $T=600$ rather than $T=540$ as discussed above. Indeed if we verify with the simulation with $T=600$ the results are now reasonably close.
```{r}
c("min" = simdr600.out$probT$min, "max" = simdr600.out$probT$max)
```
> $E(Y \mid T, H)$ ranged from 0.000 to 0.036.
and the results are almost identical
```{r}
c("min" = simdr.out$probY$min, "max" = simdr.out$probY$max)
```
## Functions
The function `simdr` is found on p. 127-128. See the script in the last section of this appendix. We just added some arguments and and output to facilitate the analysis. See the next section *Analysis*.
The Monte Carlo simulation is done with a non-parametric Monte Carlo function `mc_standdr` with the following script. See the section *Simulation* blow.
Note that `simdr` was rewritten as function `mc_standdr` to
- facilitate the analysis of the algorithm
- perform a Monte Carlo simulation with the `MonteCarlo` package
- separate the data simulation from the measurement of estimates in 2 sub functions
- `standdr_sim`: Simulate the data
- `standdr_est`: Calculate the estimates using the simulated data from `standdr_sim`
`mc_standdr` and `simdr` give exactly the same results for the simulatted data
```{r}
simdr.out <- simdr(seed = 1009)
new_simdr.out <- standdr_sim(seed = 1009)
stopifnot(identical(simdr.out$stats, new_simdr.out$stats))
```
as well as the measurements
```{r}
new_simdr.est <- standdr_est(Y = new_simdr.out$data$Y,
`T` = new_simdr.out$data$`T`,
H = new_simdr.out$data$H)
stopifnot(identical(simdr.out$est, new_simdr.est))
```
## Scripts
### `simdr`
```{r}
#| file: "lib\\fci_06-I_simdr.R"
```
### `mc_standdr`
```{r}
#| file: "lib\\fci_06-H_mc_standdr.R"
```