-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3. Data Analysis Part 1.Rmd
503 lines (362 loc) · 31.3 KB
/
3. Data Analysis Part 1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
---
title: 'Data Analysis Assignment #1'
author: "Gowani, Ali"
output:
html_document: default
---
```{r setup, include = FALSE}
# DO NOT ADD OR REVISE CODE HERE
knitr::opts_chunk$set(echo = FALSE, eval = TRUE)
```
-----
The following code chunk will:
(a) load the "ggplot2", "gridExtra" and "knitr" packages, assuming each has been installed on your machine,
(b) read-in the abalones dataset, defining a new data frame, "mydata,"
(c) return the structure of that data frame, and
(d) calculate new variables, VOLUME and RATIO.
Do not include package installation code in this document. Packages should be installed via the Console or 'Packages' tab. You will also need to download the abalones.csv from the course site to a known location on your machine. Unless a *file.path()* is specified, R will look to directory where this .Rmd is stored when knitting.
```{r analysis_setup1, message = FALSE, warning = FALSE}
# a) Load the ggplot2 and gridExtra packages.
library(ggplot2)
library(gridExtra)
library(knitr)
# b) Use read.csv() to read the abalones.csv into R, assigning the data frame to "mydata."
mydata <- read.csv("abalones.csv", sep = ",")
# c) Use the str() function to verify the structure of "mydata." You should have 1036 observations
# of eight variables.
str(mydata)
# d) Define two new variables, VOLUME and RATIO. Use the following statements to define VOLUME and
# RATIO as variables appended to the data frame "mydata."
mydata$VOLUME <- mydata$LENGTH * mydata$DIAM * mydata$HEIGHT
mydata$RATIO <- mydata$SHUCK / mydata$VOLUME
```
-----
### Test Items starts from here - There are 6 sections ##########################
##### Section 1: (6 points) Summarizing the data.
(1)(a) (1 point) Use *summary()* to obtain and present descriptive statistics from mydata. Use table() to present a frequency table using CLASS and RINGS. There should be 115 cells in the table you present.
```{r Part_1a}
# mydata <- read.csv("abalones.csv", sep = ",")
summary(mydata)
mydata_table <- (table(mydata$CLASS,mydata$RINGS))
mydata_table
```
**Question (1 point): Briefly discuss the variable types and distributional implications such as potential skewness and outliers.**
***Answer: Looking at the summary statistics we can quickly identify that there are integers, numerical and factor variables between the various columns in the data set. By looking at the Sex column, we can quickly see there are approximately 1,000 observations. The distribution between the three Sex segments (i.e.: Female, Infant and Male), seems to be approximately equally distributed (~300). The range in Whole, Shuck and Volume category indicates that there may be few outliers that will need to be evaluated depending on the output.***
(1)(b) (1 point) Generate a table of counts using SEX and CLASS. Add margins to this table (Hint: There should be 15 cells in this table plus the marginal totals. Apply *table()* first, then pass the table object to *addmargins()* (Kabacoff Section 7.2 pages 144-147)). Lastly, present a barplot of these data; ignoring the marginal totals.
```{r Part_1b}
table_SEX_CLASS <- table(mydata$SEX,mydata$CLASS)
table_SEX_CLASS
table_SEX_CLASS_MARGINS <- addmargins(table_SEX_CLASS)
table_SEX_CLASS_MARGINS
library(RColorBrewer)
barplot(table_SEX_CLASS,
main="CLASS membership, SEX-differentiated",
sub="Data Analysis Assignment #1, Problem 1.B",
xlab="Class", ylab="Frequency",
col=brewer.pal(3, name = "Set3"),
legend=rownames(table_SEX_CLASS), beside=TRUE)
```
**Essay Question (2 points): Discuss the sex distribution of abalones. What stands out about the distribution of abalones by CLASS?**
***Answer: Looking at the bar plot, it shows us the distribution between the sex and class of the abalones. The bar plot is sorted by Class and further separated by sex. It seems apparent that Abalones have more infants in Class A1 and A2 and then it starts to decrease as Abalones age and may not have the support to reproduce. Another interesting thing to note is that in Class A1, A2 and A3, the male to female ratio is higher as there are more males than females in that Class segment. However, the differences are minimized towards the A4 and A5 Class segment, perhaps, indicating that the males die off earlier in their lifecycle. This is just a hypothesis and may need further investigation.***
(1)(c) (1 point) Select a simple random sample of 200 observations from "mydata" and identify this sample as "work." Use *set.seed(123)* prior to drawing this sample. Do not change the number 123. Note that *sample()* "takes a sample of the specified size from the elements of x." We cannot sample directly from "mydata." Instead, we need to sample from the integers, 1 to 1036, representing the rows of "mydata." Then, select those rows from the data frame (Kabacoff Section 4.10.5 page 87).
Using "work", construct a scatterplot matrix of variables 2-6 with *plot(work[, 2:6])* (these are the continuous variables excluding VOLUME and RATIO). The sample "work" will not be used in the remainder of the assignment.
```{r Part_1c}
set.seed(123)
work <- mydata[sample(nrow(mydata), 200), ]
plot(work[, 2:6])
```
-----
##### Section 2: (5 points) Summarizing the data using graphics.
(2)(a) (1 point) Use "mydata" to plot WHOLE versus VOLUME. Color code data points by CLASS.
```{r Part_2a}
ggplot(data = mydata, aes(mydata$VOLUME)) +
geom_point(aes(y = mydata$WHOLE, color = mydata$CLASS)) +
ggtitle("Abalones: WHOLE versus VOLUME") +
labs(subtitle=expression(Whole~Weight~(grams)~and~Volume~(cm^3)),
y = "Whole Weight",
x = "Volume",
title = "Data Analysis Assignment #1, Section 2.A",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
scale_color_discrete(name = "Class")
```
(2)(b) (2 points) Use "mydata" to plot SHUCK versus WHOLE with WHOLE on the horizontal axis. Color code data points by CLASS. As an aid to interpretation, determine the maximum value of the ratio of SHUCK to WHOLE. Add to the chart a straight line with zero intercept using this maximum value as the slope of the line. If you are using the 'base R' *plot()* function, you may use *abline()* to add this line to the plot. Use *help(abline)* in R to determine the coding for the slope and intercept arguments in the functions. If you are using ggplot2 for visualizations, *geom_abline()* should be used.
```{r Part_2b}
abalones_ratio <- mydata$SHUCK / mydata$WHOLE
mydata_with_ratio <- cbind(mydata, abalones_ratio)
new_ratio <- mydata_with_ratio[mydata_with_ratio$abalones_ratio == max(mydata_with_ratio$abalones_ratio),]
ggplot(data = mydata, aes(mydata$WHOLE)) +
geom_point(aes(y = mydata$SHUCK, color = mydata$CLASS)) +
ggtitle("Abalones: WHOLE versus SHUCK") +
labs(subtitle="Shuck weight in grams | Whole weight in grams",
y = "Shuck Weight",
x = "Whole Weight",
title = "Data Analysis Assignment #1, Section 2.B",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
geom_abline(slope = new_ratio$abalones_ratio, intercept = 0) +
scale_color_discrete(name = "Class")
```
**Essay Question (2 points): How does the variability in this plot differ from the plot in (a)? Compare the two displays. Keep in mind that SHUCK is a part of WHOLE. Consider the location of the different age classes.**
***Answer: Comparing the bar plots in 2.A and 2.B, both of the plots seem to have a positive correlation. That is, whole weight and volume have a positive correlation, which makes sense. In addition, shuck weight and whole weight also have a positive correlation. When look at 2.A, as Class increases, the weight of that class also increases. However, the Class A1 and A2 are on the lower end of the plot as there are many infants in that segment. Looking at 2.B, it shows that the ratio of shuck weight to whole weight decreases for the A4 and A5 class segments. It is quite apparent as those plots are far below the trend and the highest ratio between shuck weight and whole weight. Most, except 1, abalones would fall below the straight line as we did not take the average but instead took the highest ratio between shuck weight and whole weight. It can be seen that the line seems to be towards the middle of the sample for Class A1 but then quickly rises while all the other samples fall below it.***
-----
##### Section 3: (8 points) Getting insights about the data using graphs.
(3)(a) (2 points) Use "mydata" to create a multi-figured plot with histograms, boxplots and Q-Q plots of RATIO differentiated by sex. This can be done using *par(mfrow = c(3,3))* and base R or *grid.arrange()* and ggplot2. The first row would show the histograms, the second row the boxplots and the third row the Q-Q plots. Be sure these displays are legible.
```{r Part_3a}
mydata_with_ratio_F = subset(mydata, mydata$SEX == "F")
mydata_with_ratio_I = subset(mydata, mydata$SEX == "I")
mydata_with_ratio_M = subset(mydata, mydata$SEX == "M")
FH <- ggplot(mydata_with_ratio_F, aes(x = mydata_with_ratio_F$RATIO)) +
geom_histogram(fill = "red", alpha = .6, bins=10) +
labs(y = "Frequency",
x = "Ratio",
subtitle = "Female Ratio")
IH <- ggplot(mydata_with_ratio_I, aes(x = mydata_with_ratio_I$RATIO)) +
geom_histogram(fill = "green", alpha = .6, bins=10) +
labs(y = "Frequency",
x = "Ratio",
subtitle = "Infant Ratio")
MH <- ggplot(mydata_with_ratio_M, aes(x = mydata_with_ratio_M$RATIO)) +
geom_histogram(fill = "blue", alpha = .6, bins=10) +
labs(y = "Frequency",
x = "Ratio",
subtitle = "Male Ratio")
FB <- ggplot(mydata_with_ratio_F, aes(y = mydata_with_ratio_F$RATIO, x = mydata_with_ratio_F$SEX)) +
geom_boxplot(fill = "red", alpha = .6) +
labs(y = "Ratio",
x = "Females",
subtitle = "Female Ratio") +
stat_boxplot(geom ='errorbar')
IB <- ggplot(mydata_with_ratio_I, aes(y = mydata_with_ratio_I$RATIO, x = mydata_with_ratio_I$SEX)) +
geom_boxplot(fill = "green", alpha = .6) +
labs(y = "Ratio",
x = "Infants",
subtitle = "Infant Ratio") +
stat_boxplot(geom ='errorbar')
MB <- ggplot(mydata_with_ratio_M, aes(y = mydata_with_ratio_M$RATIO, x = mydata_with_ratio_M$SEX)) +
geom_boxplot(fill = "blue", alpha = .6) +
labs(y = "Ratio",
x = "Males",
subtitle = "Male Ratio") +
stat_boxplot(geom ='errorbar')
FQ <- ggplot(mydata_with_ratio_F, aes(sample=RATIO, fill="red", alpha = .6)) +
stat_qq(show.legend = FALSE, colour="red") +
stat_qq_line(show.legend = FALSE)+
labs(y = "Sample Quantiles",
x = "Theoretical Quantiles",
subtitle = "Female Ratio")
IQ <- ggplot(mydata_with_ratio_I, aes(sample=RATIO, fill="green", alpha = .6)) +
stat_qq(show.legend = FALSE, colour="green") +
stat_qq_line(show.legend = FALSE) +
labs(y = "Sample Quantiles",
x = "Theoretical Quantiles",
subtitle = "Infant Ratio")
MQ <- ggplot(mydata_with_ratio_M, aes(sample=RATIO, fill="blue", alpha = .6)) +
stat_qq(show.legend = FALSE, colour="blue") +
stat_qq_line(show.legend = FALSE) +
labs(y = "Sample Quantiles",
x = "Theoretical Quantiles",
subtitle = "Male Ratio")
grid.arrange(FH, IH, MH, FB, IB, MB, FQ, IQ, MQ, ncol = 3, nrow = 3)
```
**Essay Question (2 points): Compare the displays. How do the distributions compare to normality? Take into account the criteria discussed in the sync sessions to evaluate non-normality.**
***Answer: The graphs in general seem to correspond well to the normal distribution. However, there are few outliers, making the results non-normal and skewed. If we look the histogram, we can see few outliers in each Sex segment (e.g.: Female, Infant and Male). In addition, when look at the box plots next, we can clearly see outliers. The same holds true when at the QQ plot and see how certain data points are not following the liner line. By removing the outliers and even extreme outliers then perhaps, our charts would show a normal distribution.***
(3)(b) (2 points) Use the boxplots to identify RATIO outliers (mild and extreme both) for each sex. Present the abalones with these outlying RATIO values along with their associated variables in "mydata" (Hint: display the observations by passing a data frame to the kable() function).
```{r Part_3b}
library(kableExtra)
par(mfrow = c(1,2))
boxplot(mydata_with_ratio_F$RATIO,
mydata_with_ratio_I$RATIO,
mydata_with_ratio_M$RATIO,
names = c("Female", "Infant", "Male"),
col = (c("deepskyblue3", "gold", "darkgreen")),
main = "Sex, mild outliers")
boxplot(mydata_with_ratio_F$RATIO,
mydata_with_ratio_I$RATIO,
mydata_with_ratio_M$RATIO,
names = c("Female", "Infant", "Male"),
col = (c("deepskyblue3", "gold", "darkgreen")),
main = "Sex, extreme outliers",
range = 3)
female.outliers <- boxplot.stats(mydata$RATIO[mydata$SEX == "F"])$out
female_outliers.df <- mydata[mydata$SEX == "F" & mydata$RATIO %in% female.outliers,]
kable(female_outliers.df, format = "html", caption = "Female Outliers") %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 12,
position = "left") %>%
add_footnote(c("Data Analysis Assignment #1, Section 3.B"))
infant.outliers <- boxplot.stats(mydata$RATIO[mydata$SEX == "I"])$out
infant_outliers.df <- mydata[mydata$SEX == "I" & mydata$RATIO %in% infant.outliers,]
kable(infant_outliers.df, format = "html", caption = "Infant Outliers") %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 12,
position = "left") %>%
add_footnote(c("Data Analysis Assignment #1, Section 3.B"))
male.outliers <- boxplot.stats(mydata$RATIO[mydata$SEX == "M"])$out
male_outliers.df <- mydata[mydata$SEX == "M" & mydata$RATIO %in% male.outliers,]
kable(male_outliers.df, format = "html", caption = "Male Outliers") %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 12,
position = "left") %>%
add_footnote(c("Data Analysis Assignment #1, Section 3.B"))
```
**Essay Question (2 points): What are your observations regarding the results in (3)(b)?**
***Answer: It is apparent that there are outliers in each Sex segment and even, extreme outliers in Female and Infant segments. The outliers are based on a ratio which we defined earlier as Shuck to Volume. Most of the outliers are in Female and Infants so perhaps there are biological or development reasons for the outliers. In it interesting to note there are no extreme outliers for the Male segment.***
-----
##### Section 4: (8 points) Getting insights about possible predictors.
(4)(a) (3 points) With "mydata," display side-by-side boxplots for VOLUME and WHOLE, each differentiated by CLASS There should be five boxes for VOLUME and five for WHOLE. Also, display side-by-side scatterplots: VOLUME and WHOLE versus RINGS. Present these four figures in one graphic: the boxplots in one row and the scatterplots in a second row. Base R or ggplot2 may be used.
```{r Part_4a}
gg_color_hue <- function(n) {
hues = seq(15, 375, length = n + 1)
hcl(h = hues, l = 65, c = 100)[1:n]
}
Box_VC <- ggplot(mydata, aes(y = mydata$VOLUME, x = mydata$CLASS)) +
geom_boxplot(fill = gg_color_hue(5), alpha = .9) +
labs(y = "Volume",
x = "Class",
subtitle = "Volume by Class") +
stat_boxplot(geom ='errorbar')
Box_WC <- ggplot(mydata, aes(y = mydata$WHOLE, x = mydata$CLASS)) +
geom_boxplot(fill = gg_color_hue(5), alpha = .9) +
labs(y = "Whole",
x = "Class",
subtitle = "Whole by Class") +
stat_boxplot(geom ='errorbar')
Scat_VR <- ggplot(data = mydata, aes(mydata$RINGS)) +
geom_point(aes(y = mydata$VOLUME, color = mydata$CLASS)) +
labs(subtitle="Volume and Rings",
y = "Volumes",
x = "Rings",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
scale_color_discrete(name = "Class")
Scat_WR <- ggplot(data = mydata, aes(mydata$RINGS)) +
geom_point(aes(y = mydata$WHOLE, color = mydata$CLASS)) +
labs(subtitle="Whole and Rings",
y = "Whole",
x = "Rings",
caption = "2019SU_MSDS_401-DL_SEC55 - Prof. Srinivasan") +
scale_color_discrete(name = "Class")
grid.arrange(Box_VC, Box_WC, Scat_VR, Scat_WR, ncol = 2, nrow = 2)
```
**Essay Question (5 points) How well do you think these variables would perform as predictors of age? Explain.**
***Answer: When evaluating the box plots and the scatter plots, we notice two things: outliers and patterns. It is easy to visually see in the box plots that Volume of Class A2 is higher than A1 and that the Volume of Class A3 is higher than of A2. The Volume and Whole Weight either flattens or decreases after A5. This seems to show a decent pattern for us to predict. However, we should also evaluate the second thing we noticed, which are the outliers. Outliers tend to break the pattern and we must evaluate them to determine whether they would have a significant impact to our prediction. It is interesting to also note that Class A4 has no outliers and A5 has just 1 outlier. If we look at the scatter plot, we also see a pattern. Essentially, Class A1 segment will have fewer rings than the other classes and their volume and whole weight on average will be lower than other Classes. One other thing to note is that there seems to be a significant positive correlation between volume and whole weight. As volume increases so does whole weight. Using the 4 total charts, we may be able to perform few predictions on certain scenarios but as in all models, it won't be perfect!***
-----
##### Section 5: (12 points) Getting insights regarding different groups in the data.
(5)(a) (2 points) Use *aggregate()* with "mydata" to compute the mean values of VOLUME, SHUCK and RATIO for each combination of SEX and CLASS. Then, using *matrix()*, create matrices of the mean values. Using the "dimnames" argument within *matrix()* or the *rownames()* and *colnames()* functions on the matrices, label the rows by SEX and columns by CLASS. Present the three matrices (Kabacoff Section 5.6.2, p. 110-111). The *kable()* function is useful for this purpose. You do not need to be concerned with the number of digits presented.
```{r Part_5a}
#Mean Volume by Sex and Class
mean_volume <- aggregate(VOLUME ~ SEX + CLASS, data = mydata, FUN = mean)
mean_volume_matrix <- matrix(mean_volume$VOLUME, nrow = 3, byrow = FALSE)
dimnames(mean_volume_matrix) <- list(c("Female", "Infant", "Male"), c("A1", "A2", "A3", "A4", "A5"))
print("Mean Volume by Sex and Class")
mean_volume_matrix
kable(mean_volume_matrix, format = "html", caption = "Mean Volume by Sex and Class") %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 12,
position = "left") %>%
row_spec(0, bold = T, color = "white", background = "darkgrey") %>%
column_spec(1, bold = T, border_right = T) %>%
add_footnote(c("Data Analysis Assignment #1, Section 5.A"))
#Mean Shuck by Sex and Class
mean_shuck <- aggregate(SHUCK ~ SEX + CLASS, data = mydata, FUN = mean)
mean_shuck_matrix <- matrix(mean_shuck$SHUCK, nrow = 3, byrow = FALSE)
dimnames(mean_shuck_matrix) <- list(c("Female", "Infant", "Male"), c("A1", "A2", "A3", "A4", "A5"))
print("Mean Shuck by Sex and Class")
mean_shuck_matrix
kable(mean_shuck_matrix, format = "html", caption = "Mean Shuck by Sex and Class") %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 12,
position = "left") %>%
row_spec(0, bold = T, color = "white", background = "darkgrey") %>%
column_spec(1, bold = T, border_right = T) %>%
add_footnote(c("Data Analysis Assignment #1, Section 5.A"))
#Mean Ratio by Sex and Class
mean_ratio <- aggregate(RATIO ~ SEX + CLASS, data = mydata, FUN = mean)
mean_ratio_matrix <- matrix(mean_ratio$RATIO, nrow = 3, byrow = FALSE)
dimnames(mean_ratio_matrix) <- list(c("Female", "Infant", "Male"), c("A1", "A2", "A3", "A4", "A5"))
print("Mean Ratio by Sex and Class")
mean_ratio_matrix
kable(mean_ratio_matrix, format = "html", caption = "Mean Ratio by Sex and Class") %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 12,
position = "left") %>%
row_spec(0, bold = T, color = "white", background = "darkgrey") %>%
column_spec(1, bold = T, border_right = T) %>%
add_footnote(c("Data Analysis Assignment #1, Section 5.A"))
```
(5)(b) (3 points) Present three graphs. Each graph should include three lines, one for each sex. The first should show mean RATIO versus CLASS; the second, mean VOLUME versus CLASS; the third, mean SHUCK versus CLASS. This may be done with the 'base R' *interaction.plot()* function or with ggplot2 using *grid.arrange()*.
```{r Part_5b, fig.width = 9}
mean_ratio_df <- data.frame(mean_ratio)
mean_volume_df <- data.frame(mean_volume)
mean_shuck_df <- data.frame(mean_shuck)
mean_ratio_line_graph <- ggplot(data=mean_ratio_df, aes(x=mean_ratio_df$CLASS, y=mean_ratio_df$RATIO, group=mean_ratio_df$SEX)) +
geom_line(aes(y = mean_ratio_df$RATIO, color = mean_ratio_df$SEX)) +
geom_point(aes(y = mean_ratio_df$RATIO, color = mean_ratio_df$SEX)) +
labs(subtitle="Mean Ratio per Class",
y = "Ratio",
x = "Class") +
scale_color_discrete(name = "Sex")
mean_volume_line_graph <- ggplot(data=mean_volume_df, aes(x=mean_volume_df$CLASS, y=mean_volume_df$VOLUME, group=mean_volume_df$SEX)) +
geom_line(aes(y = mean_volume_df$VOLUME, color = mean_volume_df$SEX)) +
geom_point(aes(y = mean_volume_df$VOLUME, color = mean_volume_df$SEX)) +
labs(subtitle="Mean Volume per Class",
y = "Volume",
x = "Class") +
scale_color_discrete(name = "Sex")
mean_shuck_line_graph <- ggplot(data=mean_shuck_df, aes(x=mean_shuck_df$CLASS, y=mean_shuck_df$SHUCK, group=mean_shuck_df$SEX)) +
geom_line(aes(y = mean_shuck_df$SHUCK, color = mean_volume_df$SEX)) +
geom_point(aes(y = mean_shuck_df$SHUCK, color = mean_volume_df$SEX)) +
labs(subtitle="Mean Shuck Weight per Class",
y = "Shuck",
x = "Class") +
scale_color_discrete(name = "Sex")
grid.arrange(mean_ratio_line_graph, mean_volume_line_graph, mean_shuck_line_graph, ncol = 3, nrow = 1)
```
**Essay Question (2 points): What questions do these plots raise? Consider aging and sex differences.**
***Answer: Considering each graph separately, we can see in Mean Ratio per Class graph that the increases for Females and Males in Class A2 but then decreases after that. However, for the Infants, the decrease in ratio of Shuck over Volume decreases from the beginning (Class A1). In the second graph for Mean Volume per Class, the volume for all 3 Sex segments increases until Class A4 and then it decreases for Females and Males. However, the Infant Sex segment is stabilized. Mean Shuck Weight per Class follows a similar trend but this time, all 3 Sex segments (Female, Infant and Male) decrease their weight after Class A4. There is a strong correlation in Volume and Shuck Weight as the abalones age, their Volume and Shuck Weight increases up until A4 then it either plateaus or starts to decrease. It is also interesting to note that Females, on average (mean) consistently have higher Volume and Shuck Weight compared to Males and Infants across all Class segments. Lastly, it seems that the abalones continue to grow well in to their later years. As oppose to perhaps other species that may full grow in weight and volume at an earlier age.***
5(c) (3 points) Present four boxplots using *par(mfrow = c(2, 2)* or *grid.arrange()*. The first line should show VOLUME by RINGS for the infants and, separately, for the adult; factor levels "M" and "F," combined. The second line should show WHOLE by RINGS for the infants and, separately, for the adults. Since the data are sparse beyond 15 rings, limit the displays to less than 16 rings. One way to accomplish this is to generate a new data set using subset() to select RINGS < 16. Use ylim = c(0, 1100) for VOLUME and ylim = c(0, 400) for WHOLE. If you wish to reorder the displays for presentation purposes or use ggplot2 go ahead.
```{r Part_5c}
df_new_mydata <- mydata[mydata$RINGS < 16, ]
infant_new_mydata <- df_new_mydata[df_new_mydata$SEX == "I", ]
adult_new_mydata <- df_new_mydata[df_new_mydata$SEX != "I", ]
Box_Infant_Volume <- ggplot(infant_new_mydata, aes(y = infant_new_mydata$VOLUME, x = infant_new_mydata$RINGS, group = infant_new_mydata$RINGS)) +
geom_boxplot(fill = "gold", alpha = .9) +
labs(y = "Volume",
x = "Rings",
subtitle = "Infant Volume | Rings") +
stat_boxplot(geom ='errorbar')
Box_Adult_Volume <- ggplot(adult_new_mydata, aes(y = adult_new_mydata$VOLUME, x = adult_new_mydata$RINGS, group = adult_new_mydata$RINGS)) +
geom_boxplot(fill = "forestgreen", alpha = .9) +
labs(y = "Volume",
x = "Rings",
subtitle = "Adult Volume | Rings") +
stat_boxplot(geom ='errorbar')
Box_Infant_Whole <- ggplot(infant_new_mydata, aes(y = infant_new_mydata$WHOLE, x = infant_new_mydata$RINGS, group = infant_new_mydata$RINGS)) +
geom_boxplot(fill = "gold", alpha = .9) +
labs(y = "Whole Weight",
x = "Rings",
subtitle = "Infant Whole Weight | Rings") +
stat_boxplot(geom ='errorbar')
Box_Adult_Whole <- ggplot(adult_new_mydata, aes(y = adult_new_mydata$WHOLE, x = adult_new_mydata$RINGS, group = adult_new_mydata$RINGS)) +
geom_boxplot(fill = "forestgreen", alpha = .9) +
labs(y = "Whole Weight",
x = "Rings",
subtitle = "Adult Whole Weight | Rings") +
stat_boxplot(geom ='errorbar')
grid.arrange(Box_Infant_Volume, Box_Adult_Volume, Box_Infant_Whole, Box_Adult_Whole, ncol = 2, nrow = 2)
```
**Essay Question (2 points): What do these displays suggest about abalone growth? Also, compare the infant and adult displays. What differences stand out?**
***Answer: Comparing Infants to the Adult population of Females and Males, we can see the high variation in volume and weight in Infants when compared to Adults. There also seems to be a positive correlation with both Infants and Adults that as the number of Rings increase so does the Whole Weight and Volume. However, this does not hold true when Rings increase to approximately 12. Then for both Infants and Adults, Volume and Whole Weight either flattens or starts to decrease.***
-----
##### Section 6: (11 points) Conclusions from the Exploratory Data Analysis (EDA).
**Conclusions**
**Essay Question 1) (5 points) Based solely on these data, what are plausible statistical reasons that explain the failure of the original study? Consider to what extent physical measurements may be used for age prediction.**
***Answer: Since the assignment was focused on the data derived from a study of abalones, we need to take a look at the various graphs to determine why this study was not successful. The purpose of the study was to predict the age of the abalones from physical measurements, such as, volume, weight, etc. instead of counting rings. When we look at the trend in Whole Weight Shuck Weigh to Volume, such as, in 2.A (Whole Weight to Volume) and 2.B (Shuck Weight to Volume), we clearly see a strong, positive correlations. However, that is not sufficient to conclude whether age can be predicted on its own. We would need to evaluate the data from a different perspective: sex. When we evaluate the data by dissecting it with sex and comparing the Volume and Weight to Class and by Sex then we clearly see that it will be difficult to predict age by just Volume and Weight. For example, we can see in graphs for Section 5.B that the Mean Shuck Weight decreases for Infants, Females and Males from Class A4 and to such an extent that the Mean Shuck Weight in Class A5 is approximately what the abalones for that Sex could have in Class A3. This would mean that by Mean Shuck Weight, we would not be able to tell whether the abalones were Class A3 or Class A5. If we are able to determine the Sex of the abalones then we may have a general idea of which Class the abalones may fall into but that would require for us to know the Sex of the abalones and even then, it may not be an ideal predictor.***
**Essay Question 2) (3 points) Do not refer to the abalone data or study. If you were presented with an overall histogram and summary statistics from a sample of some population or phenomenon and no other information, what questions might you ask before accepting them as representative of the sampled population or phenomenon?**
***Answer: If I were presented with a sample of a population with few graphs (histograms) and summary statistics, I would ask the following questions to better understand the data and perhaps gain insights to the population that was sampled: 1. What is our understanding of the size of the population and whether the sample size is sufficient? 2. What is the level of confidence (margin of error) required for our study? 3. How was the data obtained? 4. Are there any outliers and extreme outliers? 5. How varied is the data (e.g.: range, mean, median, standard deviation, variance, etc.)? 6. Is the data represented of all segments of the population to truly understand the phenomenon? 7. Has the data been skewed in any particular direction?***
**Essay Question 3) (3 points) Do not refer to the abalone data or study. What do you see as difficulties analyzing data derived from observational studies? Can causality be determined? What might be learned from such studies?**
***Answer: There are many challenges that can be encountered in an observational study. For example, there could be biases, miscalculations, etc. There can be ways to mitigate such challenges but that comes at a cost: time, resources and money. It is sometimes easy for us to see the data points and start connecting the dots, especially, if we are looking for a particular conclusion. However, we need to be careful and not associate a correlation with causality. Determining causality solely on observational study can be quite difficult, especially, if there are biases and other data-related errors. In addition, there could be factors outside the study that may taint the responses or data collected. We would need to put in places tight controls in the observational study to have some level of confidence that the data will yield an answer; regardless of whether that answer aligns with our hypothesis.***