-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path4. Data Analysis Part 2.Rmd
798 lines (553 loc) · 46.7 KB
/
4. Data Analysis Part 2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
---
title: "Data Analysis #2 Version 2"
author: "Gowani, Ali"
output:
html_document: default
---
```{r setup, include = FALSE}
# DO NOT ADD OR REVISE CODE HERE
knitr::opts_chunk$set(echo = FALSE, eval = TRUE)
```
### Instructions
R markdown is a plain-text file format for integrating text and R code, and creating transparent, reproducible and interactive reports. An R markdown file (.Rmd) contains metadata, markdown and R code "chunks", and can be "knit" into numerous output types. Answer the test questions by adding R code to the fenced code areas below each item. There are questions that require a written answer that also need to be answered. Enter your comments in the space provided as shown below:
***Answer: Ali Gowani***
Once completed, you will "knit" and submit the resulting .html document and the .Rmd file. The .html will present the output of your R code and your written answers, but your R code will not appear. Your R code will appear in the .Rmd file. The resulting .html document will be graded and a feedback report returned with comments. Points assigned to each item appear in the template.
**Before proceeding, look to the top of the .Rmd for the (YAML) metadata block, where the *title*, *author* and *output* are given. Please change *author* to include your name, with the format 'lastName, firstName.'**
If you encounter issues with knitting the .html, please send an email via Canvas to your TA.
Each code chunk is delineated by six (6) backticks; three (3) at the start and three (3) at the end. After the opening ticks, arguments are passed to the code chunk and in curly brackets. **Please do not add or remove backticks, or modify the arguments or values inside the curly brackets**. An example code chunk is included here:
```{r exampleCodeChunk, eval = FALSE, echo = TRUE}
# Comments are included in each code chunk, simply as prompts
#...R code placed here
#...R code placed here
```
R code only needs to be added inside the code chunks for each assignment item. However, there are questions that follow many assignment items. Enter your answers in the space provided. An example showing how to use the template and respond to a question follows.
-----
**Example Problem with Solution:**
Use *rbinom()* to generate two random samples of size 10,000 from the binomial distribution. For the first sample, use p = 0.45 and n = 10. For the second sample, use p = 0.55 and n = 10. Convert the sample frequencies to sample proportions and compute the mean number of successes for each sample. Present these statistics.
```{r Example, eval = TRUE, echo = TRUE}
set.seed(123)
sample.one <- table(rbinom(10000, 10, 0.45)) / 10000
sample.two <- table(rbinom(10000, 10, 0.55)) / 10000
successes <- seq(0, 10)
round(sum(sample.one*successes), digits = 1) # [1] 4.5
round(sum(sample.two*successes), digits = 1) # [1] 5.5
```
**Question: How do the simulated expectations compare to calculated binomial expectations?**
***Answer: The calculated binomial expectations are 10(0.45) = 4.5 and 10(0.55) = 5.5. After rounding the simulated results, the same values are obtained.***
-----
Submit both the .Rmd and .html files for grading. You may remove the instructions and example problem above, but do not remove the YAML metadata block or the first, "setup" code chunk. Address the steps that appear below and answer all the questions. Be sure to address each question with code and comments as needed. You may use either base R functions or ggplot2 for the visualizations.
-----
##Data Analysis #2
```{r analysis_setup1, message = FALSE, warning = FALSE}
# Perform the following steps to start the assignment.
# 1) Load/attach the following packages via library(): flux, ggplot2, gridExtra, moments, rockchalk, car.
# NOTE: packages must be installed via install.packages() before they can be loaded.
library(flux)
library(ggplot2)
library(gridExtra)
library(moments)
# library(rockchalk) # base R code replaces requirement for this package
library(car)
library(kableExtra)
library(dplyr)
# 2) Use the "mydata.csv" file from Assignment #1 or use the file posted on the course site. Reading
# the files into R will require sep = "" or sep = " " to format data properly. Use str() to check file
# structure.
mydata_3 <- read.csv("mydata.csv", Header=TRUE, sep = ",", nrow=3)
mydata_2 <- read.csv("mydata.csv", sep = ",")
# mydata <- read.csv(file.path("c:...", "mydata.csv"), sep = ",")
# mydata <- read.csv(file.path("c:/Rabalone/", "mydata.csv"), sep = ",")
str(mydata)
```
### Test Items starts from here - There are 10 sections - total of 75 points ##############
##### Section 1: (5 points)
(1)(a) Form a histogram and QQ plot using RATIO. Calculate skewness and kurtosis using 'rockchalk.' Be aware that with 'rockchalk', the kurtosis value has 3.0 subtracted from it which differs from the 'moments' package.
```{r Part_1a}
hist_ratio <- ggplot(mydata, aes(x = mydata$RATIO)) +
geom_histogram(fill = "#56B4E9", alpha = .6, bins = 10) +
labs(y = "Frequency",
x = "Ratio",
subtitle = "Histogram of Ratio")
qq_ratio <- ggplot(mydata, aes(sample = RATIO)) +
stat_qq(show.legend = FALSE, colour="#E69F00", alpha = .6) +
stat_qq_line(show.legend = FALSE, colour="#555555")+
labs(y = "Sample Quantiles",
x = "Theoretical Quantiles",
subtitle = "Normal Q-Q Plot (Ratio)")
grid.arrange(hist_ratio, qq_ratio, ncol = 2, nrow = 1)
skewness <- rockchalk::skewness(mydata$RATIO)
kurtosis <- rockchalk::kurtosis(mydata$RATIO)
kurtosis_wo_excess <- rockchalk::kurtosis(mydata$RATIO, excess = FALSE)
table_skewness_kurtosis <- tribble(~"Type", ~"Calculation",
"Skewness", round(skewness, 4),
"Kurtosis with Excess", round(kurtosis, 4),
"Kurtosis without Excess", round(kurtosis_wo_excess, 4),
)
kable(table_skewness_kurtosis, format = "html", caption = "Ratio: Skewness & Kurtosis") %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 12,
position = "left") %>%
add_footnote(c("Data Analysis #2, Section 1.A"))
```
(1)(b) Tranform RATIO using *log10()* to create L_RATIO (Kabacoff Section 8.5.2, p. 199-200). Form a histogram and QQ plot using L_RATIO. Calculate the skewness and kurtosis. Create a boxplot of L_RATIO differentiated by CLASS.
```{r Part_1b}
mydata$L_RATIO <- log10(mydata$RATIO)
hist_lratio <- ggplot(mydata, aes(x = mydata$L_RATIO)) +
geom_histogram(fill = "#56B4E9", alpha = .6, bins = 10) +
labs(y = "Frequency",
x = "Ratio",
subtitle = "Histogram of L_Ratio")
qq_lratio <- ggplot(mydata, aes(sample = L_RATIO)) +
stat_qq(show.legend = FALSE, colour="#E69F00", alpha = .6) +
stat_qq_line(show.legend = FALSE, colour="#555555")+
labs(y = "Sample Quantiles",
x = "Theoretical Quantiles",
subtitle = "Normal Q-Q Plot (L_Ratio)")
grid.arrange(hist_lratio, qq_lratio, ncol = 2, nrow = 1)
skewness_lratio <- rockchalk::skewness(mydata$L_RATIO)
kurtosis_lratio <- rockchalk::kurtosis(mydata$L_RATIO)
kurtosis_wo_excess_lratio <- rockchalk::kurtosis(mydata$L_RATIO, excess = FALSE)
table_skewness_kurtosis <- tribble(~"Type", ~"Calculation",
"Skewness", round(skewness_lratio, 4),
"Kurtosis with Excess", round(kurtosis_lratio, 4),
"Kurtosis without Excess", round(kurtosis_wo_excess_lratio, 4),
)
kable(table_skewness_kurtosis, format = "html", caption = "L_Ratio: Skewness & Kurtosis") %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 12,
position = "left") %>%
add_footnote(c("Data Analysis #2, Section 1.B"))
gg_color_hue <- function(n) {
hues = seq(15, 375, length = n + 1)
hcl(h = hues, l = 65, c = 100)[1:n]
}
box_lratio <- ggplot(mydata, aes(y = mydata$L_RATIO, x = mydata$CLASS)) +
stat_boxplot(geom ='errorbar', linetype = 1, width = 0.25) +
geom_boxplot(fill = gg_color_hue(5), alpha = .9) +
labs(y = "L_Ratio",
x = "Class",
subtitle = "Boxplot of L_Ratio by Class")
box_lratio
```
(1)(c) Test the homogeneity of variance across classes using *bartlett.test()* (Kabacoff Section 9.2.2, p. 222).
```{r Part_1c}
bartlett_lratio <- bartlett.test(L_RATIO ~ CLASS, data = mydata)
bartlett_lratio
```
**Essay Question: Based on steps 1.a, 1.b and 1.c, which variable RATIO or L_RATIO exhibits better conformance to a normal distribution with homogeneous variances across age classes? Why?**
***Answer: When we look at L_RATIO, it seems to exhibit better conformance to a normal distribution with homogeneous variances across age classes. This conclusion can be derived by looking at the Histogram and QQ plot. The distribution in the Histogram seems to show minimal skews and relatively normal distribution. The QQ plot shows most data points on the qq line with relatively even outliers on both ends. In addition, the boxplot shows a similar pattern with few outliers. In addition, when we transform the variable to log10, there is more of a normal distribution with kurtosis and skewness. The Bartlett Test of Homogeneity of Variances shows us that we can reject the null hypothesis.***
##### Section 2 (10 points) ###############################
(2)(a) Perform an analysis of variance with *aov()* on L_RATIO using CLASS and SEX as the independent variables (Kabacoff chapter 9, p. 212-229). Assume equal variances. Perform two analyses. First, fit a model with the interaction term CLASS:SEX. Then, fit a model without CLASS:SEX. Use *summary()* to obtain the analysis of variance tables (Kabacoff chapter 9, p. 227).
```{r Part_2a}
anova_lratio_1 <- aov(L_RATIO ~ CLASS*SEX, data = mydata)
summary(anova_lratio_1)
anova_lratio_2 <- aov(L_RATIO ~ CLASS + SEX, data = mydata)
summary(anova_lratio_2)
```
**Essay Question: Compare the two analyses. What does the non-significant interaction term suggest about the relationship between L_RATIO and the factors CLASS and SEX?**
***Answer: When we compare the two analyses and look at the interactions between the factors (CLASS and SEX) and L_Ratio, we can see that there is no statistically significant interaction between the two variables as indicated by the pvalue (0.87709). However, the primary effect of the variables CLASS and SEX seem to be statistically significant.***
(2)(b) For the model without CLASS:SEX (i.e. an interaction term), obtain multiple comparisons with the *TukeyHSD()* function. Interpret the results at the 95% confidence level (*TukeyHSD()* will adjust for unequal sample sizes).
```{r Part_2b}
TukeyHSD(anova_lratio_2)
```
**Additional Essay Question: first, interpret the trend in coefficients across age classes. What is this indicating about L_RATIO? Second, do these results suggest male and female abalones can be combined into a single category labeled as 'adults?' If not, why not?**
***Answer: Looking at the CLASS coefficients, there seems to be a significant interaction between all of them except, perhaps A2-A1. When we look at the results between the sexes, we can see reject the null hypothesis for Infant and Females as well as, Infant and Males. However, looking at the p-value for Males and Females, we can conclude that these can be combined into a single group: Adults.***
###### Section 3: (10 points) ##################
(3)(a1) We combine "M" and "F" into a new level, "ADULT". (While this could be accomplished using *combineLevels()* from the 'rockchalk' package, we use base R code because many students do not have access to the rockchalk package.) This necessitated defining a new variable, TYPE, in mydata which had two levels: "I" and "ADULT".
```{r Part_3a1}
# here we show how to define the new variable TYPE using only base R functions (no need for outside packages)
mydata$TYPE <- character(nrow(mydata)) # initialize the TYPE column as all blanks
for (i in seq(along = mydata$SEX)) {
mydata$TYPE[i] <- 'I'
if (mydata$SEX[i] == 'M' || mydata$SEX[i] == 'F') mydata$TYPE[i] <- 'ADULT'
}
mydata$TYPE <- factor(mydata$TYPE)
cat('\nCheck on definition of TYPE object (should be an integer): ', typeof(mydata$TYPE))
cat('\nmydata$TYPE is treated as a factor: ', is.factor(mydata$TYPE), '\n')
table(mydata$SEX, mydata$TYPE)
mean(as.numeric(mydata$CLASS))
mean(as.numeric(as.factor(mydata$CLASS)))
mydata$CLASS <- factor(mydata$CLASS)
mean(as.numeric(mydata$TYPE))
mean(as.numeric(mydata$TYPE))
res <- ddply(mydata, "CLASS", transform, result=c(NA, diff(log(score))))
res
res=diff(log(mydata$LENGTH))s
summary(mydata)
```
(3)(a2) Present side-by-side histograms of VOLUME. One should display infant volumes and, the other, adult volumes.
```{r Part_3a2}
par(mfrow = c(1, 2))
break_infants <- hist(mydata$VOLUME[mydata$TYPE == 'I'], col = "#56B4E9", xlab = "Volume",
main = "Infant Volumes")
break_adults <- hist(mydata$VOLUME[mydata$TYPE == 'ADULT'], col = "#E69F00", xlab = "Volume",
main = "Adult Volumes")
hist_gg_infant <- ggplot(data = NULL, aes(x = mydata$VOLUME[mydata$TYPE == "I"])) +
geom_histogram(fill = "#56B4E9", alpha = .6, breaks = break_infants$breaks) +
labs(y = "Frequency",
x = "Volume",
subtitle = "Infant Volumes") +
coord_cartesian(xlim = c(0, 800))
hist_gg_adult <- ggplot(data = NULL, aes(x = mydata$VOLUME[mydata$TYPE == "ADULT"])) +
geom_histogram(fill = "#E69F00", alpha = .6, breaks = break_adults$breaks) +
labs(y = "Frequency",
x = "Volume",
subtitle = "Adult Volumes") +
coord_cartesian(xlim = c(0, 1000))
grid.arrange(hist_gg_infant, hist_gg_adult, ncol = 2, nrow = 1)
```
**Essay Question: Compare the histograms. How do the distributions differ? Are there going to be any difficulties separating infants from adults based on VOLUME?**
***Answer: The histogram of Infant Volumes skews to the right, whereas, the histogram of Adult Volumes seems to be normally distributed. In the histogram of Infant Volumes, we also see that at the higher end of the volume axis, there may be few outliers. This may cause some overlap between the two groups. However, most of the Infant population seems to be less than 200, while the majority of the Adult population seems to be greater than 200. This would indicate that volume may be a useful factor in separating infants from adults.***
(3)(b) Create a scatterplot of SHUCK versus VOLUME and a scatterplot of their base ten logarithms, labeling the variables as L_SHUCK and L_VOLUME. Please be aware the variables, L_SHUCK and L_VOLUME, present the data as orders of magnitude (i.e. VOLUME = 100 = 10^2 becomes L_VOLUME = 2). Use color to differentiate CLASS in the plots. Repeat using color to differentiate by TYPE.
```{r Part_3b}
mydata$L_VOLUME <- log10(mydata$VOLUME)
mydata$L_SHUCK <- log10(mydata$SHUCK)
scat_volume_class <- ggplot(data = mydata, aes(mydata$VOLUME)) +
geom_point(aes(y = mydata$SHUCK, color = mydata$CLASS), alpha = .6) +
labs(subtitle="Shuck and Volume by Class",
y = "Shuck",
x = "Volume",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="right", legend.title=element_text(size = 9)) +
theme(legend.position = c(0.85, 0.3)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
coord_cartesian(xlim = c(0, 1200)) +
scale_color_discrete(name = "Class")
scat_lvolume_class <- ggplot(data = mydata, aes(mydata$L_VOLUME)) +
geom_point(aes(y = mydata$L_SHUCK, color = mydata$CLASS), alpha = .6) +
labs(subtitle="L_Shuck and L_Volume by Class",
y = "L_Shuck",
x = "L_Volume",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="right", legend.title=element_text(size = 9)) +
theme(legend.position = c(0.85, 0.3)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
coord_cartesian(xlim = c(0, 3.5)) +
scale_color_discrete(name = "Class")
scat_volume_type <- ggplot(data = mydata, aes(mydata$VOLUME)) +
geom_point(aes(y = mydata$SHUCK, color = mydata$TYPE), alpha = .6) +
labs(subtitle="Shuck and Volume by Type",
y = "Shuck",
x = "Volume",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="bottom", legend.title=element_text(size = 9)) +
theme(legend.position = c(0.85, 0.2)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
coord_cartesian(xlim = c(0, 1200)) +
scale_color_discrete(name = "Type", labels = c("Adult", "Infant"))
scat_lvolume_type <- ggplot(data = mydata, aes(mydata$L_VOLUME)) +
geom_point(aes(y = mydata$L_SHUCK, color = mydata$TYPE), alpha = .6) +
labs(subtitle="L_Shuck and L_Volume by Type",
y = "L_Shuck",
x = "L_Volume",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="bottom", legend.title=element_text(size = 9)) +
theme(legend.position = c(0.85, 0.2)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
coord_cartesian(xlim = c(0, 3.5)) +
scale_color_discrete(name = "Type", labels = c("Adult", "Infant"))
grid.arrange(scat_volume_class, scat_lvolume_class, ncol = 2, nrow = 1)
grid.arrange(scat_volume_type, scat_lvolume_type, ncol = 2, nrow = 1)
```
**Additional Essay Question: Compare the two scatterplots. What effect(s) does log-transformation appear to have on the variability present in the plot? What are the implications for linear regression analysis? Where do the various CLASS levels appear in the plots? Where do the levels of TYPE appear in the plots?**
***Answer: The plots for Shuck and Volume display a lot of overlap for both Class and Type. For example, on the Type plot for Shuck and Volume, we can barely see the Infants in the group as Adults are all over the plot, indicating that there is not much separation. However, when we look at the log-transformed plots, we see a better relationship between Class and Type. For example, the Adults in the log-transformed plot are towards the top-right quadrant of the plot, whereas the Infants are on the bottom-left quadrant. The same thing can be said of the log-transformed plot for Class, where we see Class A5 to the top-right quadrant and Class A1 to the bottom-left quadrant.***
###### Section 4: (5 points) ###################################
(4)(a1) Since abalone growth slows after class A3, infants in classes A4 and A5 are considered mature and candidates for harvest. Reclassify the infants in classes A4 and A5 as ADULTS. This reclassification could have been achieved using *combineLevels()*, but only on the abalones in classes A4 and A5. We will do this recoding of the TYPE variable using base R functions. We will use this recoded TYPE variable, in which the infants in A4 and A5 are reclassified as ADULTS, for the remainder of this data analysis assignment.
```{r Part_4a1}
for (i in seq(along = mydata$TYPE)) {
if (mydata$CLASS[i] == 'A4' || mydata$CLASS[i] == 'A5') mydata$TYPE[i] <- 'ADULT'
}
mydata$TYPE <- factor(mydata$TYPE)
cat('\nCheck on redefinition of TYPE object (should be an integer): ', typeof(mydata$TYPE))
cat('\nmydata$TYPE is treated as a factor: ', is.factor(mydata$TYPE), '\n')
cat('\nThree-way contingency table for SEX, CLASS, and TYPE:\n')
print(table(mydata$SEX, mydata$CLASS, mydata$TYPE))
```
(4)(a2) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and TYPE (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2 and Black Section 14.2). Use the multiple regression model: L_SHUCK ~ L_VOLUME + CLASS + TYPE. Apply *summary()* to the model object to produce results.
```{r Part_4a2}
model <- lm(L_SHUCK ~ L_VOLUME + CLASS + TYPE, data = mydata)
summary(model)
```
**Essay Question: Interpret the trend in CLASS levelcoefficient estimates? (Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays).**
***Answer: Based on the CLASS level coefficient estimates, it can be interpreted that as CLASS increases then there is a greater decrease in L_SHUCK. This would also indicate that L_SHUCK increases more significantly with the CLASSA2 and CLASSA3 than with the CLASSA4 and CLASSA5.***
**Additional Essay Question: Is TYPE an important predictor in this regression? (Hint: This question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions.) Explain your conclusion.**
***Answer: In the previous question, our analysis showed that TYPE plays a significant factor. However, it does not seem that TYPE has a strong coefficient and may not be as important predictor as we thought earlier.***
-----
The next two analysis steps involve an analysis of the residuals resulting from the regression model in (4)(a) (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2).
-----
###### Section 5: (5 points) #################################
(5)(a) If "model" is the regression object, use model$residuals and construct a histogram and QQ plot. Compute the skewness and kurtosis. Be aware that with 'rockchalk,' the kurtosis value has 3.0 subtracted from it which differs from the 'moments' package.
```{r Part_5a}
hist_residuals <- ggplot(mydata, aes(x = model$residuals)) +
geom_histogram(fill = "#56B4E9", alpha = .6, bins = 10) +
labs(y = "Frequency",
x = "Residuals",
subtitle = "Histogram of Residuals")
qq_residuals <- ggplot(mydata, aes(sample=model$residuals)) +
stat_qq(show.legend = FALSE, colour="#E69F00", alpha = .6) +
stat_qq_line(show.legend = FALSE, colour="#555555")+
labs(y = "Sample Quantiles",
x = "Theoretical Quantiles",
subtitle = "QQ Plot of Residuals")
grid.arrange(hist_residuals, qq_residuals, ncol = 2, nrow = 1)
residuals_skewness <- rockchalk::skewness(model$residuals)
residuals_kurtosis <- rockchalk::kurtosis(model$residuals)
residuals_kurtosis_wo_excess <- rockchalk::kurtosis(model$residuals, excess = FALSE)
table_residuals <- tribble(~"Type", ~"Calculation",
"Skewness", round(residuals_skewness, 4),
"Kurtosis with Excess", round(residuals_kurtosis, 4),
"Kurtosis without Excess", round(residuals_kurtosis_wo_excess, 4),
)
kable(table_residuals, format = "html", caption = "Residuals: Skewness & Kurtosis") %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 12,
position = "left") %>%
add_footnote(c("Data Analysis #2, Section 5.A"))
```
(5)(b) Plot the residuals versus L_VOLUME, coloring the data points by CLASS and, a second time, coloring the data points by TYPE. Keep in mind the y-axis and x-axis may be disproportionate which will amplify the variability in the residuals. Present boxplots of the residuals differentiated by CLASS and TYPE (These four plots can be conveniently presented on one page using *par(mfrow..)* or *grid.arrange()*. Test the homogeneity of variance of the residuals across classes using *bartlett.test()* (Kabacoff Section 9.3.2, p. 222).
```{r Part_5b}
scat_residual_class <- ggplot(data = mydata, aes(mydata$L_VOLUME)) +
geom_point(aes(y = model$residuals, color = mydata$CLASS), alpha = .6) +
labs(subtitle="Residual and L_Volume by Class",
y = "Residual",
x = "L_Volume",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="right", legend.title=element_text(size = 9)) +
theme(legend.position = c(0.9, 0.25)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
coord_cartesian(xlim = c(0, 4), ylim = c(-0.4, 0.4)) +
scale_color_discrete(name = "Class")
scat_residual_type <- ggplot(data = mydata, aes(mydata$L_VOLUME)) +
geom_point(aes(y = model$residuals, color = mydata$TYPE), alpha = .6) +
labs(subtitle="Residual and L_Volume by Type",
y = "Residual",
x = "L_Volume",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="bottom", legend.title=element_text(size = 9)) +
theme(legend.position = c(0.9, 0.15)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
coord_cartesian(xlim = c(0, 4), ylim = c(-0.4, 0.4)) +
scale_color_discrete(name = "Type", labels = c("Infant", "Adult"))
grid.arrange(scat_residual_class, scat_residual_type, ncol = 2, nrow = 1)
box_residual_class <- ggplot(mydata, aes(y = model$residuals, x = mydata$CLASS)) +
stat_boxplot(geom ='errorbar', linetype = 1, width = 0.25) +
geom_boxplot(fill = gg_color_hue(5), alpha = .9, outlier.color = "#AF4425") +
labs(y = "Residuals",
x = "Class",
subtitle = "Boxplot of Residuals by Class") +
coord_cartesian(ylim = c(-0.4, 0.4))
box_residual_type <- ggplot(mydata, aes(y = model$residuals, x = mydata$TYPE)) +
stat_boxplot(geom ='errorbar', linetype = 1, width = 0.25) +
geom_boxplot(fill = c("#56B4E9", "#E69F00"), alpha = .9, outlier.color = "#AF4425") +
labs(y = "Residuals",
x = "Type",
subtitle = "Boxplot of Residuals by Type") +
coord_cartesian(ylim = c(-0.4, 0.4)) +
scale_x_discrete(labels = c("Infant","Adult"))
grid.arrange(box_residual_class, box_residual_type, ncol = 2, nrow = 1)
bartlett_residuals <- bartlett.test(model$residuals ~ CLASS, data = mydata)
bartlett_residuals
```
**Essay Question: What is revealed by the displays and calculations in (5)(a) and (5)(b)? Does the model 'fit'? Does this analysis indicate that L_VOLUME, and ultimately VOLUME, might be useful for harvesting decisions? Discuss.**
***Answer: Looking at the scatter plots of Residual and L_Volume by Class and Type, there seems to be a lot of overlap and over-crowding to the right. There is some separation to the left but not enough to be able to differentiate between Class and Type. When we evaluate the boxplot, it shows even distribution between the Class and Residual as all the Classes are towards the middle and near 0, with very few outliers. The same can be said of the boxplot between Residual and Type.***
-----
There is a tradeoff faced in managing abalone harvest. The infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This assignment will use VOLUME to form binary decision rules to guide harvesting. If VOLUME is below a "cutoff" (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.
The next steps in the assignment will require consideration of the proportions of infants and adults harvested at different cutoffs. For this, similar "for-loops" will be used to compute the harvest proportions. These loops must use the same values for the constants min.v and delta and use the same statement "for(k in 1:10000)." Otherwise, the resulting infant and adult proportions cannot be directly compared and plotted as requested. Note the example code supplied below.
-----
#### Section 6: (5 points) ########################
(6)(a) A series of volumes covering the range from minimum to maximum abalone volume will be used in a "for loop" to determine how the harvest proportions change as the "cutoff" changes. Code for doing this is provided.
```{r Part_6a}
idxi <- mydata$TYPE == "I"
idxa <- mydata$TYPE == "ADULT"
max.v <- max(mydata$VOLUME)
min.v <- min(mydata$VOLUME)
delta <- (max.v - min.v)/10000
prop.infants <- numeric(10000)
prop.adults <- numeric(10000)
volume.value <- numeric(10000)
total.infants <- sum(idxi)
total.adults <- sum(idxa)
for (k in 1:10000) {
value <- min.v + k*delta
volume.value[k] <- value
prop.infants[k] <- sum(mydata$VOLUME[idxi] <= value)/total.infants
prop.adults[k] <- sum(mydata$VOLUME[idxa] <= value)/total.adults
}
# prop.infants shows the impact of increasing the volume cutoff for
# harvesting. The following code shows how to "split" the population at
# a 50% harvest of infants.
n.infants <- sum(prop.infants <= 0.5)
split.infants <- min.v + (n.infants + 0.5)*delta # This estimates the desired volume.
n.adults <- sum(prop.adults <= 0.5)
split.adults <- min.v + (n.adults + 0.5)*delta
head(volume.value)
head(prop.infants)
head(prop.adults)
```
(6)(b) Present a plot showing the infant proportions and the adult proportions versus volume.value. Compute the 50% "split" volume.value for each and show on the plot.
```{r Part_6b}
ggplot(data = NULL, aes(volume.value)) +
geom_line(aes(y = prop.adults, color = "Adults"), alpha = .6) +
geom_line(aes(y = prop.infants, color = "Infants"), alpha = .6) +
labs(subtitle="Proportion of Adults and Infants Protected",
y = "Proportion",
x = "Volume",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="right", legend.title=element_text(size = 9)) +
theme(legend.position = c(0.9, 0.15)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
coord_cartesian(xlim = c(0, 1000), ylim = c(0, 1)) +
scale_color_manual(name = "Type", values = c("#6AB187", "#DE7A22")) +
geom_hline(aes(yintercept = .5), color = "#324851", alpha = .6) +
geom_vline(aes(xintercept = split.infants), color = "#324851", alpha = .6) +
geom_vline(aes(xintercept = split.adults), color = "#324851", alpha = .6) +
geom_text(size = 3.3, aes((split.adults + 50), .47), label = round(split.adults, 2), colour = "#324851") +
geom_text(size = 3.3, aes((split.infants + 50), .47), label = round(split.infants, 2), colour = "#324851")
```
**Essay Question: The two 50% "split" values serve a descriptive purpose illustrating the difference between the populations. What do these values suggest regarding possible cutoffs for harvesting?**
***Answer: We can see the volume values for the two 50% splits to be quite different (133.82 versus 384.51) thus indicating that the 50% "split" values are good cutoffs for harvesting.***
-----
This part will address the determination of a volume.value corresponding to the observed maximum difference in harvest percentages of adults and infants. To calculate this result, the vectors of proportions from item (6) must be used. These proportions must be converted from "not harvested" to "harvested" proportions by using (1 - prop.infants) for infants, and (1 - prop.adults) for adults. The reason the proportion for infants drops sooner than adults is that infants are maturing and becoming adults with larger volumes.
-----
###### Section 7: (10 points) #######################
(7)(a) Evaluate a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value. Compare to the 50% "split" points determined in (6)(a). There is considerable variability present in the peak area of this plot. The observed "peak" difference may not be the best representation of the data. One solution is to smooth the data to determine a more representative estimate of the maximum difference.
```{r Part_7a}
difference <- ((1 - prop.adults) - (1 - prop.infants))
head(difference)
ggplot(data = NULL, aes(volume.value)) +
geom_line(aes(y = difference), color = "#DE7A22", alpha = .6) +
labs(subtitle="Difference in Population Harvested",
y = "Proportion",
x = "Volume",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="right") +
theme(legend.position = c(0.9, 0.15)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
coord_cartesian(xlim = c(0, 1000), ylim = c(0, .6))
```
(7)(b) Since curve smoothing is not studied in this course, code is supplied below. Execute the following code to create a smoothed curve to append to the plot in (a). The procedure is to individually smooth (1-prop.adults) and (1-prop.infants) before determining an estimate of the maximum difference.
```{r Part_7b}
y.loess.a <- loess(1 - prop.adults ~ volume.value, span = 0.25,
family = c("symmetric"))
y.loess.i <- loess(1 - prop.infants ~ volume.value, span = 0.25,
family = c("symmetric"))
smooth.difference <- predict(y.loess.a) - predict(y.loess.i)
ggplot(data = NULL, aes(volume.value)) +
stat_smooth(aes(y = difference), method = "gam", formula = y ~ s(x, bs = "cs"), color = "#8D230F", alpha = .6, linetype = 2, size = .5) +
geom_line(aes(y = difference), color = "#DE7A22", alpha = .6) +
labs(subtitle="Difference in Population Harvested",
y = "Proportion",
x = "Volume",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="right") +
theme(legend.position = c(0.9, 0.15)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
coord_cartesian(xlim = c(0, 1000), ylim = c(0, .6))
```
(7)(c) Present a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value with the variable smooth.difference superimposed. Determine the volume.value corresponding to the maximum smoothed difference (Hint: use *which.max()*). Show the estimated peak location corresponding to the cutoff determined.
```{r Part_7c}
max_smooth_difference <- volume.value[which.max(smooth.difference)]
paste("Cutoff:", max_smooth_difference)
ggplot(data = NULL, aes(volume.value)) +
stat_smooth(aes(y = difference), method = "gam", formula = y ~ s(x, bs = "cs"), color = "#8D230F", alpha = .6, linetype = 2, size = .5) +
geom_line(aes(y = difference), color = "#DE7A22", alpha = .6) +
labs(subtitle="Difference in Population Harvested",
y = "Proportion",
x = "Volume",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="right") +
theme(legend.position = c(0.9, 0.15)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
coord_cartesian(xlim = c(0, 1000), ylim = c(0, .6)) +
scale_color_manual(name = "Type", values = c("#6AB187", "#DE7A22")) +
geom_vline(aes(xintercept = max_smooth_difference), color = "#324851", alpha = .6, linetype = 2) +
geom_text(size = 3.1, aes((max_smooth_difference + 15), .45), angle = 90, label = paste0("Volume: ", round(max_smooth_difference, 2)), colour = "#324851", fontface = "bold")
```
(7)(d) What separate harvest proportions for infants and adults would result if this cutoff is used? Show the separate harvest proportions (NOTE: the adult harvest proportion is the "true positive rate" and the infant harvest proportion is the "false positive rate").
Code for calculating the adult harvest proportion is provided.
```{r Part_7d}
max.diff.TPR <- (1 - prop.adults)[which.max(smooth.difference)] # [1] 0.7416332
paste("True Positive Rate (TPR):", max.diff.TPR)
max.diff.FPR <- (1 - prop.infants)[which.max(smooth.difference)] # [1] 0.1764706
paste("False Positive Rate (FPR):", max.diff.FPR)
```
-----
There are alternative ways to determine cutoffs. Two such cutoffs are described below.
-----
###### Section 8: (10 points) ###################
(8)(a) Harvesting of infants in CLASS "A1" must be minimized. The smallest volume.value cutoff that produces a zero harvest of infants from CLASS "A1" may be used as a baseline for comparison with larger cutoffs. Any smaller cutoff would result in harvesting infants from CLASS "A1."
Compute this cutoff, and the proportions of infants and adults with VOLUME exceeding this cutoff. Code for determining this cutoff is provided. Show these proportions.
```{r Part_8a}
infant_cutoff <- volume.value[volume.value > max(mydata[mydata$CLASS == "A1" & mydata$TYPE == "I", "VOLUME"])][1] # [1] 206.786
paste("Cutoff:", infant_cutoff)
zero_A1_adults_TPR <- sum(mydata$VOLUME[mydata$TYPE == "ADULT"] > infant_cutoff) / sum(idxa)
paste("True Positive Rate (TPR):", zero_A1_adults_TPR)
zero_A1_infants_FPR <- sum(mydata$VOLUME[mydata$TYPE == "I"] > infant_cutoff) / sum(idxi)
paste("False Positive Rate (FPR):", zero_A1_infants_FPR)
```
(8)(b) Another cutoff is one for which the proportion of adults not harvested equals the proportion of infants harvested. This cutoff would equate these rates; effectively, our two errors: 'missed' adults and wrongly-harvested infants. This leaves for discussion which is the greater loss: a larger proportion of adults not harvested or infants harvested? This cutoff is 237.7383. Calculate the separate harvest proportions for infants and adults using this cutoff. Show these proportions. Code for determining this cutoff is provided.
```{r Part_8b}
equal_error_vol <- volume.value[which.min(abs(prop.adults - (1-prop.infants)))] # [1] 237.6391
paste("Cutoff:", equal_error_vol)
equal_error_TPR <- sum(mydata$VOLUME[mydata$TYPE == "ADULT"] > equal_error_vol) / sum(idxa)
paste("True Positive Rate (TPR):", equal_error_TPR)
equal_error_FPR <- sum(mydata$VOLUME[mydata$TYPE == "I"] > equal_error_vol) / sum(idxi)
paste("False Positive Rate (FPR):", equal_error_FPR)
```
##### Section 9: (5 points) ###########
(9)(a) Construct an ROC curve by plotting (1 - prop.adults) versus (1 - prop.infants). Each point which appears corresponds to a particular volume.value. Show the location of the cutoffs determined in (7) and (8) on this plot and label each.
```{r Part_9}
ggplot(data = NULL, aes(1-prop.infants, 1-prop.adults)) +
geom_line(aes(y = 1-prop.adults), color = "#DE7A22", alpha = .6) +
labs(subtitle="ROC Curve of Adult and Infant Harvest Proprotions",
y = "Adult Harvest Proportion",
x = "Infant Harvest Proportion",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
theme(legend.position="right") +
theme(legend.position = c(0.9, 0.15)) +
theme(panel.background = element_blank(), axis.line = element_line(colour = "#dddddd")) +
geom_abline(slope = 1, intercept = 0, alpha = .6, linetype = 2, colour = "#8D230F") +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
geom_point(aes(x = equal_error_FPR, equal_error_TPR), size = 3, shape = 1, color = "#324851") +
geom_point(aes(x = zero_A1_infants_FPR, zero_A1_adults_TPR), size = 3, shape = 1, color = "#324851") +
geom_point(aes(x = max.diff.FPR, max.diff.TPR), size = 3, shape = 1, color = "#324851") +
geom_text(size = 3.5, aes(.17, .7), label = "Max difference volume: 262.1", colour = "#324851", hjust = 0) +
geom_text(size = 3.5, aes(.23, .76), label = "Equal harvest volume: 237.6", colour = "#324851", hjust = 0) +
geom_text(size = 3.5, aes(.3, .81), label = "Zero A1 infants volume: 206.8", colour = "#324851", hjust = 0)
```
(9)(b) Numerically integrate the area under the ROC curve and report your result. This is most easily done with the *auc()* function from the "flux" package. Areas-under-curve, or AUCs, greater than 0.8 are taken to indicate good discrimination potential.
```{r Part_9b}
area_roc <- flux::auc(x = (1 - prop.infants), y = (1 - prop.adults))
paste("Area Under the ROC Curve:", area_roc)
```
##### Section 10: (10 points) ###################
(10)(a) Prepare a table showing each cutoff along with the following:
1) true positive rate (1-prop.adults,
2) false positive rate (1-prop.infants),
3) harvest proportion of the total population
```{r Part_10}
yld_max_diff <- (max.diff.TPR * total.adults + max.diff.FPR * total.infants) / (total.adults + total.infants)
yld_zero_A1_inf <- (zero_A1_adults_TPR * total.adults + zero_A1_infants_FPR * total.infants) / (total.adults + total.infants)
yld_equal_error <- (equal_error_TPR * total.adults + equal_error_FPR * total.infants) / (total.adults + total.infants)
table_cutoffs <- tribble(~"Potential Cutoffs", ~"Volume", ~"TPR", ~"FPR", ~"PropYield",
"max.difference", round(max_smooth_difference, 3), round(max.diff.TPR, 3), round(max.diff.FPR, 3), round(yld_max_diff, 3),
"zero.A1.infants", round(infant_cutoff, 3), round(zero_A1_adults_TPR, 3), round(zero_A1_infants_FPR, 3), round(yld_zero_A1_inf, 3),
"equal.error", round(equal_error_vol, 3), round(equal_error_TPR, 3), round(equal_error_FPR, 3), round(yld_equal_error, 3)
)
kable(table_cutoffs, format = "html", caption = "Table: Cutoffs") %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 12,
position = "left") %>%
add_footnote(c("Data Assignment #2, Section 10"))
```
**Essay Question: Based on the ROC curve, it is evident a wide range of possible "cutoffs" exist. Compare and discuss the three cutoffs determined in this assignment.**
***Answer: We want to minimize the False Positive Rate (FPR) while maximize the yield from the harvest proportion of the total population. The cutoff for max.difference has the lowest FPR as well as the lowest PropYield. The zero.A1.infant is the exact opposite, where it has the highest FPR and also the highest PropYield. The equal.error cutoff seems to be in the middle of both cutoffs when it comes to FPR and PropYield. We should evaluate our objective to determine what is more important to us. Whether it is to minimize over harvesting by looking at the lower FPR or to harvest with the highest yield possible within reason. Understanding the objective and looking at the trade-off will help us determine which cutoff we should select.***
**Final Essay Question: Assume you are expected to make a presentation of your analysis to the investigators How would you do so? Consider the following in your answer:**
1. Would you make a specific recommendation or outline various choices and tradeoffs?
2. What qualifications or limitations would you present regarding your analysis?
3. If it is necessary to proceed based on the current analysis, what suggestions would you have for implementation of a cutoff? 4) What suggestions would you have for planning future abalone studies of this type?
***Answer: 1. As a consultant, we are asked for our recommendation in various business problems so I would prefer to apply a similar approach. I would first discuss our objective to the investigators and ensure we are aligned. Then I would present our findings and share the different factors and independent variables we evaluated to determine significance and the existance correlations and any relationships. I would then present the cutoffs with the appropriate tradeoffs within each scenario. Lastly, I would make a recommendation based on the analysis and our findings. However, my recommendation is just that: a recommendation. The final decision would be for the investigators to take. 2. There were several limitations to our analysis from quality of our data to the scope of our analysis. In addition, we had outliers in our data analysis that would need to be further evaluated. 3. My suggestion would focus on reducing the False Positive Rate and minimize over harvesting. Thus, I would recommend the cutoff of either Zero A1 Harvest or Equal Error Harvest. 4. We need to collect more data, whether that means to gather data from a greater number of abalones or capture other features of the abalones that we had not considered. This would help us identify more patterns and see whether we can correlate the data with new independent variables.***