This repository has been archived by the owner on Dec 22, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathEEB313 Updated Code (Nov 26).Rmd
428 lines (330 loc) · 18.1 KB
/
EEB313 Updated Code (Nov 26).Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
---
title: "EEB313 Mid Project Update"
output: pdf_document
date: "2024-11-14"
author: Gabi, Golshan, Samara, Jared
---
The original paper, “Investigating the role of biological characteristics on adaptive capacity within alien and rare plant species in China” had several flaws which prevented us from continuing to use its dataset for our project. Namely, the authors attempted to quantify adaptive capacity by simply asking "five experts" to rank categories of plant traits and their levels (e.g., in the category of reproductive mode there were three levels: (a) sexual, (b) asexual, or (c) combined sexual-asexual reproduction) they ranked from "most important: 5" to "least important: 1". Then all of these scores for all plant traits were compiled using a formula that was not fully disclosed in the main article, rather, in the complementary supplementary files. Despite the value of “expert opinions”, more robust quantitative methods of determining significant variables should have been utilized. Additionally, it is not a simple feat to quantify adaptive capacity through various “expert-informed” (as they mention) traits as this typically requires more comprehensive methods that include genetic, environmental, and evolutionary variables considered together. Considering these constraints, this dataset would have been incredibly hard to utilize in a unique and useful way.
Instead, we have chosen to analyze the data from a paper by Jiang et al. titled “Restoration of native saltmarshes can reverse arthropod assemblages and trophic interactions changed by a plant invasion”. The authors of the paper investigated how arthropod assemblages and trophic interactions changes in estuary salt marshes that are invaded with the non native Spartina alterniflora. They established 15 plots across 5 different plant communities which represents a gradient of stages of Spartina (exotic plant) invasion and removal: OP: Original Phragmites monoculture, TP: Threatened Phragmites monoculture, PS: Phragmites–Spartina mixture, IS: Invasive Spartina monoculture, and RP: Restored Phragmites monoculture. One of the datasets Jiang et al. (2022) provides is a set of raw data which includes four sheets: Arthropod list, Plant traits, Soil traits, and Stable isotopes. To test our hypothesis, we ultimately decided to focus on which traits, whether it be plant or soil traits, is/are most significant in explaining the variation in arthropod community (i.e. Shannon diversity, recall: relative abundance and evenness) across plant communities, and if there are differences in how these traits predict arthropod diversity between plant communities.
We can find the same conclusions of Jiang et al. (2022) in our own way to determine and verify the results. Jiang et al. found that the plant traits most responsible for driving changes in arthropod diversity was aboveground biomass, leaf N, and plant density. They found that the soil trait most responsible for driving changes in arthropod diversity was soil salinity. Here we state:
H0: Soil and plant traits represented by principal component axes (PC1 and PC2) do not significantly predict arthropod diversity across all communities.
H1: Soil and plant traits represented by principal component axes (PC1 and PC2) are strong predictors of arthropod diversity across all communities.
H0.2 Soil and Plant traits represented by principal component axes do not differ in how they predict arthropod diversity across different communities.
H2: Soil and Plant traits represented by principal component axes differ in how they predict arthropod diversity across different communities.
To prepare our data for analysis, we had to reformat the data frame. We started by separating each sheet (plant traits, soil traits, and arthropod list) on the provided Excel file to their own csv with altered simplified columns. Then, we worked with the arthropod data frame that listed the raw abundances of each arthropod species in every treatment in each of the 5 plant communities. The original data frame had each treatment as columns, and every arthropod species listed in the first column. We transformed the data so the first column would have the treatments and every subsequent column had species as headers. We also removed unnecessary columns (columns that had order, family and trophic group, as well as a column that only had NA values).
Once the data was properly formatted, we could continue with the data analysis:
The following method will be used to create a linear model to determine the strongest predictors of Shannon diversity of all our communities. We intend to compare our results with that of the original authors, in terms of determining which plant and soil traits are primarily responsible for driving differences in arthropod diversity across communities and groups
- We calculated Shannon diversity per treatment for all of the communities (15 treatments for 5 communities = 75 total Shannon diversity values) (we calculated Shannon diversity because the authors chose that diversity metric, we want to keep it consistent so our data is as comparable as possible)
- We then determined the distribution type that best fits the Shannon diversity indices by conducting maximum log-likelihood calculations for a normal, gamma, and inverse gaussian probability density distribution. We used dnorm(), dgamma(), and a hand-made inverse gaussian formula to loop over a range of parameter values.
- Our MLE for normal was -19.28 (when mean = 3.43 and sd = 0.31), the MLE for gamma was -20.56 (when shape = 116.1 and scale = 0.0295), the MLE for inverse gaussian was -21.73 (when mu = 3.4 and lambda = 379.6). Thus, a normal probability density distribution fits our data best. This allows us to use a linear regression model for our future analysis.
- We ran a pairwise correlation test between each of the 11 plant and soil traits.
- We found many of the soil traits to have high correlation values, and thus be linearly correlated. On the other hand, our plant traits did not show high correlations between other traits. We intend to use this data further in our analysis to interpret our results of the linear models to understand the variable aggregations within our PC axes.
- Due to the strong correlations between plant and soil traits, we decided to run a PCA to determine which of our 11 plant and soil traits contributed most strongly to the variation between our treatments to inform model selection, we generated a PCA of all possible predictor variables from the dataset (Soil salinity, soil N, soil C, soil P, soil moisture, soil pH, aboveground plant biomass, leaf C, leaf N, leaf P). We then determined which traits fall on each PC axis using contrib/pc.out function.
- The plant and soil traits that are associated with PC1 are the most responsible (55.64%) for driving variance between treatments, with Soil N content (12.6%), Soil moisture (12.5%) and soil pH (11.4%) being the most significant. For PC2 (10.5% of total variance), the variables, soil C content (20.1%), plant density (16.3%) and soil salinity (15.4%) contributed most strongly.
The analysis we still need to perform are:
- Perform model selection, using the PC axes as fixed effects
- Generate the following models:
- with only PC1
- with both PC1 and PC2,
- with both PC1, PC2 and their interaction
- Select the model with the lowest AIC score
- The fixed effect (PC axis) with the largest coefficient (absolute value) is the most responsible for driving changes in arthropod diversity (Test H1)
- Conduct a group-by-group analysis of each plant community - to see if there are any differences in the drivers of diversity in each community (Test H2)
- Compare results with the original paper
R Code so far:
# Loading packages
```{r, message=F, warning=F}
require(tidyverse)
require(deSolve)
require(vegan)
require(car)
require(rgl)
require(ggfortify)
require(fitdistrplus)
options(rgl.useNULL = TRUE)
rgl::setupKnitr(autoprint = TRUE)
```
# Loading csv files
```{r}
#Provided original csv files (separated sheets from the master excel file)
# column names were edited for simplicity sake
read_csv("Arthropod_List.csv") -> Arthropod_List
read_csv("Plant_Traits.csv") -> Plant_Traits
read_csv("Soil_Traits.csv") -> Soil_Traits
```
## Rearraged Arthropod_List dataframe (each species is a column, each treatment is a row)
```{r}
teatment_cols <- Arthropod_List[,5:79]
species_col <- Arthropod_List[c("Species")]
new_AL <- data.frame(species_col, teatment_cols)
AL_long <- new_AL %>%
pivot_longer(
cols = (starts_with("OP") | starts_with("RP") | starts_with("TP") |
starts_with("IS") | starts_with("PS")), # Select treatment columns
names_to = "Treatment", # Name of new "Treatment" column
values_to = "Count" # # Name of new "Count" column
)
```
```{r}
AL_wide <- AL_long %>%
pivot_wider(
names_from = Species, # Create new columns from Species
values_from = Count # Fill values from the "Count" column
)
fixed_arthro <- AL_wide
```
# Shannon Diversity Calculations
```{r}
#Excluding the first and last column (as they d not contain data)
for_shannon = fixed_arthro[,2:350]
#Shannon diversity
shannon_indices = diversity(for_shannon, index = "shannon")
```
```{r}
#Reformatting with column names specified
final_shannon = data.frame(Treatment = fixed_arthro$Treatment,
Shannon = shannon_indices)
```
## Histogram of Shannon Diversity
Preliminary to determining distribution type.
```{r}
ggplot(final_shannon, aes(Shannon)) +
geom_histogram(binwidth = 0.05, alpha = 0.7, colour="white")
```
## Another way to histogram... looking normal..
```{r}
hist(final_shannon$Shannon, probability = TRUE, col = "lightblue", border = "black", breaks = 25)
lines(density(final_shannon$Shannon), col = "red", lwd = 2) # Adds a density curve
```
This is looks normal, however we need to determine distribution type in order to make the correct analysis and conclusions about our findings...
#Data wrangling
##Group Treatments into Communities
```{r}
#Data is separated by into 75 treatments -> groups with 15 treatments each
#To draw better conclusions, we need to group by group
final_shannon$Community <- ifelse(startsWith(final_shannon$Treatment, "OP"), "OP",
ifelse(startsWith(final_shannon$Treatment, "RP"), "RP",
ifelse(startsWith(final_shannon$Treatment, "TP"), "TP",
ifelse(startsWith(final_shannon$Treatment, "IS"), "IS",
ifelse(startsWith(final_shannon$Treatment, "PS"), "PS", "Other")))))
```
Rearranging columns so "Community" is beside "Treatment"
```{r}
final_shannon <- final_shannon[, c("Treatment", "Community", "Shannon")]
```
Visualization of Shannon diversity for each community
```{r}
ggplot(final_shannon, aes(x = Community, y = Shannon)) +
geom_bar(stat = "identity", fill = "yellowgreen", color = "black") +
theme_minimal() +
labs(title = "Shannon Diversity for each Community", x = "Community",
y = "Shannon Diversity")
```
## Add all the predictor variable columns (soil and plant traits) onto final_shannon dataframe...
```{r}
final_shannon$soil_C <- Soil_Traits$Soil_C
final_shannon$soil_N <- Soil_Traits$Soil_N
final_shannon$soil_P <- Soil_Traits$Soil_P
final_shannon$soil_pH <- Soil_Traits$Soil_pH
final_shannon$soil_sal <- Soil_Traits$Soil_Salinity
final_shannon$soil_wat <- Soil_Traits$Soil_Water
final_shannon$biomass_above <- Plant_Traits$Aboveground_Biomass
final_shannon$plant_dens <- Plant_Traits$Plant_Density
final_shannon$leaf_N <- Plant_Traits$Leaf_N
final_shannon$leaf_C <- Plant_Traits$Leaf_C
final_shannon$leaf_P <- Plant_Traits$Leaf_P
```
## Making sure all the number columns are numeric!!
Note that this also converts 'n/a' cells into NULL -- we will have to replace this later with an estimate!
```{r}
final_shannon[, 3:14] <- lapply(final_shannon[, 3:14], as.numeric)
```
###Saving the final_shannon dataframe as a csv file for sharing and future edits...
```{r}
write.csv(final_shannon, "final_shannon.csv")
```
##Filtering data for our final dataset
OP14 and OP15 have NULL leaf_N values (due to equipment failure) -> here, we replace them with the mean of their community instead...
```{r}
replaced_FS <- final_shannon
```
```{r}
#First filter out the data to involve only the OP community and then
# exclude the 2 rows with "n/a" leaf_N values:
final_shannon %>%
filter(final_shannon$Community == "OP",
!is.na(final_shannon$leaf_N)) -> OPonly_noNA_FS
```
```{r}
# Replacing the 2 cells with the mean of the OP'S leaf_N column
replaced_FS <- final_shannon
replaced_FS[replaced_FS$Treatment == "OP14", "leaf_N"] <-
mean(OPonly_noNA_FS$leaf_N)
replaced_FS[replaced_FS$Treatment == "OP15", "leaf_N"] <-
mean(OPonly_noNA_FS$leaf_N)
```
###Renaming master data frame to a shorter name bc I will have to use it over and over again in log-likelihood evalutors!
```{r}
data <- replaced_FS
```
# Making ranges of parameter combinations for each distribution
## MLE: We test for normal, gamma, and inverse gaussian because they are the only distribution types that can reasonably fit our data and are generally accepted in a generalized linear mixed model!
```{r}
norm_combos <- expand.grid(mean = seq(2.4, 4.2, 0.01),
sd = seq(0.15, 0.7, 0.01))
gamma_combos <- expand.grid(shape = seq(100, 150, 0.1),
scale = seq(0.02, 0.04, 0.0005))
invgaus_combos <- expand.grid(mu = seq(2, 6, 0.1),
lambda = seq(200, 600, 0.1))
```
#Log-Likelihood evalutions
```{r}
norm_loglik <- function(x = data$Shannon, mean, sd){
return (sum(log(dnorm(x = x, mean = mean, sd = sd))))
}
```
```{r}
gamma_loglik <- function(x = data$Shannon, shape, scale){
return (sum(log(dgamma(x = x, shape = shape, scale = scale))))
}
```
```{r}
invgaus_loglik <- function(x = data$Shannon, mu, lambda){
return (sum(log(
sqrt(lambda/(2*pi*x^3)) * exp((-lambda*(x-mu)^2)/(2*x*mu^2))
)))
}
```
```{r}
LogLik_norm <- c() # empty log-likelihoods vector for NORMAL
for (i in 1:nrow(norm_combos)){
LogLik_norm[i] <- norm_loglik(mean = norm_combos$mean[i],
sd = norm_combos$sd[i])
}
LLs_norm <- data.frame(LogLik_Norm = LogLik_norm,
mean = norm_combos$mean,
sd = norm_combos$sd)
```
```{r}
LogLik_gamma <- c() # empty log-likelihoods vector for GAMMA
for (i in 1:nrow(gamma_combos)){
LogLik_gamma[i] <- gamma_loglik(shape = gamma_combos$shape[i],
scale = gamma_combos$scale[i])
}
LLs_gamma <- data.frame(LogLik_Gamma = LogLik_gamma,
shape = gamma_combos$shape,
scale = gamma_combos$scale)
```
```{r}
LogLik_invgaus <- c() # empty log-likelihoods vector for INVERSE GAUSSIAN
for (i in 1:nrow(invgaus_combos)){
LogLik_invgaus[i] <- invgaus_loglik(mu = invgaus_combos$mu[i],
lambda = invgaus_combos$lambda[i])
}
LLs_invgaus <- data.frame(LogLik_InvGaus = LogLik_invgaus,
mu = invgaus_combos$mu,
lambda = invgaus_combos$lambda)
```
```{r}
ggplot(LLs_norm, aes(x = mean, y = LogLik_Norm)) + geom_line()
ggplot(LLs_norm, aes(x = sd, y = LogLik_Norm)) + geom_line()
MLE_norm <- LLs_norm %>% subset(LogLik_Norm == max(LogLik_Norm))
MLE_norm
```
```{r}
ggplot(LLs_gamma, aes(x = shape, y = LogLik_Gamma)) + geom_line()
ggplot(LLs_gamma, aes(x = scale, y = LogLik_Gamma)) + geom_line()
MLE_gamma <- LLs_gamma %>% subset(LogLik_Gamma == max(LogLik_Gamma))
MLE_gamma
```
```{r}
ggplot(LLs_invgaus, aes(x = mu, y = LogLik_InvGaus)) + geom_line()
ggplot(LLs_invgaus, aes(x = lambda, y = LogLik_InvGaus)) + geom_line()
MLE_invgaus <- LLs_invgaus %>% subset(LogLik_InvGaus == max(LogLik_InvGaus))
MLE_invgaus
```
# Find which varaiables are linearly and positively correlated
```{r}
par(mfrow=c(1,1))
pairs(data[, c("soil_C", "soil_N", "soil_P", "soil_pH", "soil_sal", "soil_wat",
"biomass_above", "plant_dens", "leaf_N", "leaf_C", "leaf_P")])
```
#Test for co-linearality
```{r}
cor_matrix <- cor(data[, c("soil_C", "soil_N", "soil_P", "soil_pH", "soil_sal",
"soil_wat", "biomass_above", "plant_dens", "leaf_N",
"leaf_C", "leaf_P")],
use = "pairwise.complete.obs")
print(cor_matrix)
```
#PCA
```{r}
pca.out <- prcomp(data[, c("soil_C", "soil_N", "soil_P", "soil_pH", "soil_sal",
"soil_wat", "biomass_above", "plant_dens", "leaf_N",
"leaf_C", "leaf_P")], scale.=T)
```
```{r}
pca.out
```
```{r}
summary(pca.out)
```
```{r}
head(pca.out$rotation)
```
## Visualized PCA
```{r}
autoplot(pca.out, x=1,y=2,
data=data,
colour="Community",
frame=TRUE) +
theme_classic()
```
#Things to do after PCA: regression on PC1, PC2, and interaction!! Discriminate!...
```{r}
data_frame_test <- as.data.frame (pca.out$rotation)
```
```{r}
write.csv(data_frame_test, "/Users/gabi/Downloads/pca_dataframe.csv")
```
```{r}
axes <- predict(pca.out, newdata=data)
head(axes, 4)
```
```{r}
dat<- cbind (data, axes)
test_model1 <- lm(formula=Shannon~PC1, data=dat)
summary(test_model1)
test_model2 <- lm(formula=Shannon~PC1+PC2, data=dat)
summary(test_model2)
test_model3 <- lm(formula=Shannon~PC1*PC2, data=dat)
summary(test_model3)
test_model4 <- lm(formula=Shannon~PC2, data=dat)
summary(test_model4)
```
```{r}
AIC(test_model1)
AIC(test_model2)
AIC(test_model3)
AIC(test_model4)
AIC(test_model5)
AIC(test_model6)
AIC(test_model7)
AIC(test_model8)
```
tldr: model 3 is the best (PC1 and PC2 and their interaction)
```{r}
test_model5 <- lmer(formula=Shannon~PC1*PC2 + (1|Community), data=dat)
summary(test_model5)
test_model6 <- lmer(formula=Shannon~PC1+PC2 + (1|Community), data=dat)
summary(test_model6)
test_model7 <- lmer(formula=Shannon~PC1 + (1|Community), data=dat)
summary(test_model7)
test_model8 <- lmer(formula=Shannon~PC2 + (1|Community), data=dat)
summary(test_model8)
```
```{r}
ggplot(data=dat, aes(x=PC1, y=Shannon)) +
geom_point() +
geom_smooth()
```