This repository has been archived by the owner on Dec 22, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathEEB313 Project Final.Rmd
454 lines (340 loc) · 19.8 KB
/
EEB313 Project Final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
---
title: "EEB313 Project Final"
output: html_document
date: "2024-12-09"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse) #First install/read in any necessary packages
library(lubridate)
library(stringr)
citation("stringr")
```
1. Loading in the salamanders dataset
```{r}
#Read in the data set
salamanders <- read.csv("redback_salamander_abundance.csv")
salamanders
```
2. Wrangling the data
- Lots of NAs in the dataset and we are interested in the abundances of each unique morph, so we have to filter out the observations that have NAs for those columns
```{r}
#Get rid of the first row (French Translation of column names) and filter out the NAs of the abundance counts for each morph type
present_salamanders <- salamanders %>%
slice(-c(1)) %>%
filter(!Redback.form.of.Eastern.Redback.Salamander.Count %in% "N/A") %>%
filter(!Leadback.form.of.Eastern.Redback.Salamander.Count %in% "N/A")
present_salamanders #lots of observations lost after filtering
```
We are interested in some potentially biologically significant variables for salamander abundances- lets look at potential ones we could incorporate or analyze in our future models
```{r}
#Looking at the soil data to see if we can use it and have consistent observations across time - soil metrics could be indicators for precipitation
salamanders %>%
filter(!Soil.pH %in% "N/A")
salamanders %>%
filter(!Soil.Moisture %in% "N/A") #even fewer observations, and none that can be used since there are no attached morph abundances - so soil moisture cannot be used in our model
```
Since the Dates Months and Years are in the dataset as characters, lets use lubridate to alter these columns and prepare them to be analyzed
```{r}
present_salamanders <- present_salamanders %>% #Using Lubridate to create dates and month columns in correct format
mutate(Date = dmy(Date),
Month = month(Date, label = TRUE))
present_salamanders
```
Next in our wrangling, we need to calculate Salamander abundances for each ecomorph type.
```{r}
#Calculating the redback abundances
redback_abundance <- present_salamanders %>%
mutate(Redback.form.of.Eastern.Redback.Salamander.Count = as.numeric(Redback.form.of.Eastern.Redback.Salamander.Count)) %>%
group_by(Year, Month, Plot.Name) %>%
summarize(redback_abun = sum(Redback.form.of.Eastern.Redback.Salamander.Count, na.rm = TRUE))
# Then check the result
head(redback_abundance)
#Calculating the leadback abundances
leadback_abundance <- present_salamanders %>%
mutate(Leadback.form.of.Eastern.Redback.Salamander.Count = as.numeric(Leadback.form.of.Eastern.Redback.Salamander.Count)) %>%
group_by(Year, Month, Plot.Name) %>%
summarize(leadback_abun = sum(Leadback.form.of.Eastern.Redback.Salamander.Count, na.rm = TRUE))
head(leadback_abundance)
```
Now, we want to combine these calulated abundances into one dataframe, joining the data correctly. We do this by joining by shared Year, Location, and Month to ensure the data points are matched correctly.
```{r}
#Combining the two abundances into one data frame and joining by shared rows
total_abundances <- full_join(redback_abundance, leadback_abundance,
by = c("Month", "Year", "Plot.Name"))
total_abundances
```
Next, we want to incorporate average temperatures across month and year, since we are interested in how temperature affects abundance specifically. We group by the same columns as earlier, month, year and location.
```{r}
#find the average temperature for a given month, year, and loaction
temp_by_month_and_year <- present_salamanders %>%
filter(!Air.Temperature..degC. %in% "N/A") %>%
mutate(Air.Temperature..degC. = as.numeric(Air.Temperature..degC.)) %>%
group_by(Year, Month, Plot.Name) %>%
summarize(avg_temp = mean(Air.Temperature..degC., na.rm = TRUE))
temp_by_month_and_year$Year <- as.numeric(temp_by_month_and_year$Year) #convert this to numeric
temp_by_month_and_year
```
Now join the average temperature data to the abundances data:
```{r}
#join the temperatures with the abundance data
total_abundances$Year <- as.numeric(temp_by_month_and_year$Year) #convert this to numeric
abun_and_temps <- full_join(total_abundances, temp_by_month_and_year,
by = c("Year", "Plot.Name", "Month"))
abun_and_temps
```
3. Data visualization
Now that we have our data correctly situated, we aimed to visualize the raw data of abundance against temperature for each ecomoprh separately.
```{r}
ggplot(abun_and_temps, aes(x = avg_temp, y = redback_abun)) +
geom_point(size = 3, alpha = 0.7) +
# facet_wrap(~Plot.Name) +
labs(x = "Average Temperature",
y = "Redback Abundance",
title = "Redback Salamander Abundance with Average Temperature") +
theme_minimal()
ggplot(abun_and_temps, aes(x = avg_temp, y = leadback_abun)) +
geom_point(size = 3, alpha = 0.7) +
# facet_wrap(~Plot.Name) +
labs(x = "Average Temperature",
y = "Leadback Abundance",
title = "Leadback Salamander Abundance with Average Temperature") +
theme_minimal()
```
From this, we can see a general trend that as the average temperature at a given time increases, the observed abundances appear to be increasing. Based on the visualization, we decided to try to fit both a linear model and a quadratic model. There appears to be some data points at higher temperatures that indicate lower abundance, and that abundance increases with temperature but then decreases, which could suggest that a quadratic regression would fit the data better. But we will test for this directly in our next steps.
4. Fitting the models to our data First, we want to combine the two abundances into one column, and create a new column specifying ecomorph type. This makes it simpler in the long run, when we want to visualize ecomorph as a covariate directly and interaction terms incorporating ecomorph type. This also helps to include more observations into our model, which can help increase the statistical power.
```{r}
rbind(
data.frame(abundance = as.numeric(abun_and_temps$leadback_abun), eco_morph = "leadback", temp = as.numeric(abun_and_temps$avg_temp)),
data.frame(abundance = as.numeric(abun_and_temps$redback_abun), eco_morph = "redback", temp = as.numeric(abun_and_temps$avg_temp))
) -> data
```
Now, lets fit linear models first to the data:
```{r}
lmmodel <- glm(abundance ~ temp + eco_morph + temp*eco_morph, data = data, family = "poisson")
summary(lmmodel)
AIC(lmmodel)
```
Let's fit quadratic:
```{r}
quadmodel <- glm(abundance ~ temp + eco_morph+ I(temp^2) + temp*eco_morph , data = data, family = "poisson")
summary(quadmodel)
AIC(quadmodel)
```
Based off of this, our quadratic model appears to be the better fit to the data. We based this by comparing the AICs, which assesses which model is a better fit to the data. The lower the AIC, the superior the model fit.
Now, we want to create a prediction grid for our model to predict how abundance changes in response to temperature changes for each ecomorph.
```{r}
new_data <- expand.grid(
temp = seq(min(data$temp), max(data$temp), length.out = 100),
eco_morph = unique(data$eco_morph)
)
```
Here we are creating a column for predicted values, that we can add a line for in the model.
```{r}
predictions <- predict(quadmodel, newdata = new_data, type = "response")
new_data$predicted <- predictions
```
```{r}
plot(abundance ~ temp, data = data, col = as.factor(eco_morph), pch = 16,
main = "Temperature Effects on Salamander Abundance",
xlab = "Temperature (°C)",
ylab = "Abundance")
#adding prediction lines for each ecomorph
for(morph in unique(data$eco_morph)) {
subset_data <- new_data[new_data$eco_morph == morph,]
lines(subset_data$temp, subset_data$predicted,
col = as.factor(morph))
}
legend("topright",
legend = unique(data$eco_morph),
col = as.factor(unique(data$eco_morph)),
pch = 16)
```
Here, we can see lines showing the predicted abundances at a given temperature for each ecomorph.
5. Now, we want to analyze other potential predictors for our model, such as precipitation. But first, we have to backtrack and wrangle the data to clean it for analysis.
```{r}
precipitation <- salamanders %>%
slice(-c(1)) %>%
mutate(precipitation = as.numeric(str_extract(Precipitation.in.the.Last.24.hours..mm., "[0-9]+"))) %>%
mutate(Date = dmy(Date),
Year = as.numeric(Year),
Month = month(Date, label = TRUE))
```
Now, we want to calculate the average precipitation for a given year, month, and location.
```{r}
precipitation <- precipitation %>%
group_by(Year, Month, Plot.Name) %>%
summarize(avg_precip = mean(precipitation, na.rm = TRUE))
```
Now, we want to join the precipitation to the abundance temps and precipitation data, and ensure that any NAs are filtered that could have been reintroduced in combining datasets.
```{r}
abun_temps_precip <- full_join(abun_and_temps, precipitation,
by = c("Year", "Plot.Name", "Month"))
abun_temps_precip <- abun_temps_precip%>%
filter(!is.na(avg_temp)) %>%
filter(!is.na(leadback_abun)) %>%
filter(!is.na(leadback_abun))
```
Lets visualize how abundances change in response to precipitation with the raw data:
```{r}
abun_temps_precip %>%
ggplot(aes(x = avg_precip, y = redback_abun)) +
geom_point(size = 3, alpha = 0.7) +
# facet_wrap(~Plot.Name) +
labs(x = "Average Precipitation",
y = "Redback Abundance",
title = "Redback Salamander Abundance with Average Precipitation") +
theme_minimal()
abun_temps_precip %>%
ggplot(aes(x = avg_precip, y = leadback_abun)) +
geom_point(size = 3, alpha = 0.7) +
# facet_wrap(~Plot.Name) +
labs(x = "Average Precipitation",
y = "Leadback Abundance",
title = "Leadback Salamander Abundance with Average Precipitation") +
theme_minimal()
```
There appears to not be a lot of observations at various precipitation levels and lots of clustering, but we will incorporate it in our model as a covariate to see if there is any statistically significant relationship.
Now to create the data frame that will be used to fit this new model:
```{r}
rbind(
data.frame(abundance = as.numeric(abun_temps_precip$leadback_abun), eco_morph = "leadback", temp = as.numeric(abun_temps_precip$avg_temp), precipitation = as.numeric(abun_temps_precip$avg_precip), location = abun_temps_precip$Plot.Name),
data.frame(abundance = as.numeric(abun_temps_precip$redback_abun), eco_morph = "redback", temp = as.numeric(abun_temps_precip$avg_temp), precipitation = as.numeric(abun_temps_precip$avg_precip), location = abun_temps_precip$Plot.Name)
) -> data
```
```{r}
quadmodel2 <- glm(abundance ~ temp + I(temp^2) + eco_morph + temp*eco_morph + precipitation, data = data, family = "poisson")
summary(quadmodel2)
AIC(quadmodel2)
```
Introducing precipitation as a covariate decreases the AIC by around 4 units, so we will incorporate precipitation in the models going forward.
6. Sampling bias and Ensuring Independence of Observations
Since we decided to use abundance as a response variable, this could raise the concern that abundance observations are not independent of each other (or could be attributed to abundance patterns from previous years.) To negate this, we first visualized what were the general abundance patterns year to year for the morphs.
```{r}
#First some wranging to calculate the yearly abundances
new_abun_temp_precip <- abun_temps_precip %>%
group_by(Year) %>%
summarize(yearly_abun_lead = sum(leadback_abun),
yearly_abun_red = sum(redback_abun)) %>%
arrange((Year))
#lets first look at how abundance patterns are changing annually
plot(new_abun_temp_precip$Year, new_abun_temp_precip$yearly_abun_lead, type = 'l',
col = 'blue3',
main = "Salamander Abundance by Year and Ectomorph",
xlab = 'Year',
ylab = 'Salamander Abundance',
ylim = range(c(0, 450)))
lines(new_abun_temp_precip$Year, new_abun_temp_precip$yearly_abun_red, col = 'red3')
legend("topright",
legend = c("Leadback", "Redback"),
col = c("blue3", "red3"),
lty = 1)
```
From this we can visually see that there doesn't seem that abundance is steadily increasing with year (as we could maybe expect to see if abundances were not independent of each other). In other words, abundance being higher in one year does not appear to mean abundance is going to be higher in the year following it. However, we will account for this in our model by including year as a covariate.
Further, through our project concerns of sampling bias have come up. We have evidence to suggest that sampling effort has not remained consistent across the duration of the study.
```{r}
unique(present_salamanders$Date)
```
Looking at all the unique dates where observations were recorded from, we can see that the dates across the duration of the study are very few overall. Certain months are sampled more consistently over time (May) while others (June) are sampled less consistently. This leads us to question if there are simply less observations or sampling periods at higher temperatures, due to more of the sampling effort taking place at months that likely wouldn't have higher temperatures (\> 25 degrees Celsius).
```{r}
testforfrequency <- present_salamanders %>%
mutate(Air.Temperature..degC. = as.numeric(Air.Temperature..degC.)) %>%
mutate(Year = as.numeric(Year))
testforfrequency %>%
ggplot() +
geom_histogram(aes(x = Air.Temperature..degC.), bins = 30) +
labs(title = "Frequency of Observations by Temperature", x = "Temperature (°C)", y = "Number of Observations") +
theme_minimal()
```
Here we see that the majority of the observations takes place within the 10-20 degree celsius range, with very few observations taking place in the 25-30+ degree range. Since this histogram accounts for all observations (the total number of recorded observations/samples at a given temperature range, not abundance), it raises the concern that the trends we see with higher abundances of salamanders are attributed less so to a temperature response and more so to the increased sampling effort in certain conditions over others.
We now want to add year as a covariate in the model (to ensure that abundance observations are independent of each other) and number of observations (which acts as measure of sampling bias by summing the total number of observations in a given condition- grouped by month, year, and location)
```{r}
present_salamanders1 <- present_salamanders%>%
group_by(Year, Month, Plot.Name) %>%
summarize(number_of_observations = n()) #creating number of observations
present_salamanders1
num_observations<- data.frame(present_salamanders1$Year, present_salamanders1$Month, present_salamanders1$Plot.Name, present_salamanders1$number_of_observations)
num_observations <- num_observations %>%
mutate(Year = as.numeric(num_observations$present_salamanders1.Year)) %>%
mutate(Month = present_salamanders1.Month) %>%
mutate(Plot.Name = present_salamanders1.Plot.Name) %>%
mutate(number_of_observations = present_salamanders1.number_of_observations) #some wrangling to simplify column names
abun_temps_precip <- full_join(
full_join(abun_and_temps, precipitation, by = c("Year", "Plot.Name", "Month")),
num_observations,
by = c("Year", "Plot.Name", "Month")
) #joining the data frames
abun_temps_precip <- abun_temps_precip%>% #filtering any NAs added by combining dataframes
filter(!is.na(avg_temp)) %>%
filter(!is.na(leadback_abun)) %>%
filter(!is.na(leadback_abun))
abun_temps_precip <- abun_temps_precip[, -c(8, 9, 10, 11)] #removing redundant columns
colnames(abun_temps_precip) #verifying accurate column removal
```
Now we want to create the final dataframe to analyze the model accounting for year and sampling bias:
```{r}
rbind(
data.frame(abundance = abun_temps_precip$leadback_abun, eco_morph = "leadback", temp = as.numeric(abun_temps_precip$avg_temp), precipitation = as.numeric(abun_temps_precip$avg_precip), year = as.numeric(abun_temps_precip$Year)),
data.frame(abundance = abun_temps_precip$redback_abun, eco_morph = "redback", temp = as.numeric(abun_temps_precip$avg_temp), precipitation = as.numeric(abun_temps_precip$avg_precip), year = as.numeric(abun_temps_precip$Year))) -> data
rbind(
data.frame(data, num_observations = as.numeric(abun_temps_precip$number_of_observations, Month = abun_temps_precip$Month))
) -> data
new_model <- glm(abundance ~ temp + I(temp^2) + eco_morph + temp*eco_morph + precipitation + num_observations + year, data = data, family = "poisson")
summary(new_model)
plot(new_model)
AIC(new_model)
```
This new model now has the lowest AIC, with the number of observations, temperature, interaction of temperature and ecomorph, ecomorph type, and year having extremely low probabilities that the true effect is 0 (albeit some of these coefficients appear numerically really small). To further assess the impacts each covariate has on the model, we want to now perform some analyses to observe how each covariate uniquely affects abundance when all other covariates are held constant.
First, what happens when we look specifically at how sampling effort (number of observations) affects observed abundances?
```{r}
# Define fixed values for predictors - define sampling effort
new_observations <- seq(min(data$num_observations), max(data$num_observations), by = 10)
fixed_temp <- mean(data$temp) # Hold temperature constant at the average
fixed_year <- median(data$year) # Hold year constant using the median year
eco_morph <- c("redback", "leadback")
precipitation <- mean(data$precipitation) #fixed precipitation at mean precipitation
new_data <- expand.grid(
temp = fixed_temp,
eco_morph = eco_morph,
num_observations = new_observations,
year = fixed_year,
precipitation = precipitation
)
new_data$predicted_abundance <- predict(new_model, new_data, type = "response")
ggplot(data, aes(x = num_observations, y = abundance, color = eco_morph)) +
geom_point() +
geom_line(data = new_data, aes(x = num_observations, y = predicted_abundance, color = eco_morph)) +
labs(
x = "Sampling Effort (Number of Observations)",
y = "Abundance",
color = "Ecomorph"
) +
ggtitle("Effect of Sampling Effort on Abundance") +
theme_minimal()
```
This graph helps visualize that when all other covariates are held constant, there is a clear positive relationship between sampling effort and observed abundances, with the prediction line for each ecomorph displayed.
Now, we want to look at how abundance patterns with year change, with all other covariates held constant.
```{r}
#Change fixed params
fixed_observations <- max(data$num_observations) #holding number of observations constant at highest number
changing_year <- data$year
new_data <- expand.grid(
temp = fixed_temp,
eco_morph = eco_morph,
num_observations = fixed_observations,
year = changing_year,
precipitation = precipitation
)
new_data$predicted_abundance <- predict(new_model, new_data, type = "response")
ggplot(data, aes(x = year, y = abundance, color = eco_morph)) +
geom_point() +
geom_line(data = new_data, aes(x = year, y = predicted_abundance, color = eco_morph)) +
labs(
x = "Year",
y = "Abundance",
color = "Ecomorph"
) +
ggtitle("Effect of Year on Abundance") +
theme_minimal()
```
From this, we can see that when all other covariates are held constant, abundance is overall predicted to slightly increase with year. If we accept this relationship as significant at alpha \< 0.05, this could call into question a fundamental assumption of our model, which is that abundances are independent year to year and not dependent on each other. However, given the earlier analysis showing the extent of the relationship with sampling effort, and the and argue that the relationships we observe are better explained by the systemic bias and inconsistencies in sampling effort across locations, months and times, and due to random variation year to year, rather than a lack of independence in the abundance observations.