EEB313 Project Final.Rmd

---
title: "EEB313 Project Final"
output: html_document
date: "2024-12-09"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse) #First install/read in any necessary packages
library(lubridate)
library(stringr)
citation("stringr")
```

1.  Loading in the salamanders dataset

```{r}
#Read in the data set
salamanders <- read.csv("redback_salamander_abundance.csv")
salamanders
```

2.  Wrangling the data

-   Lots of NAs in the dataset and we are interested in the abundances of each unique morph, so we have to filter out the observations that have NAs for those columns

```{r}
#Get rid of the first row (French Translation of column names) and filter out the NAs of the abundance counts for each morph type
present_salamanders <- salamanders %>%
  slice(-c(1)) %>%
  filter(!Redback.form.of.Eastern.Redback.Salamander.Count %in% "N/A") %>%
  filter(!Leadback.form.of.Eastern.Redback.Salamander.Count %in% "N/A") 
present_salamanders #lots of observations lost after filtering 
```

We are interested in some potentially biologically significant variables for salamander abundances- lets look at potential ones we could incorporate or analyze in our future models

```{r}
#Looking at the soil data to see if we can use it and have consistent observations across time - soil metrics could be indicators for precipitation 
salamanders %>%
  filter(!Soil.pH %in% "N/A") 

salamanders %>%
  filter(!Soil.Moisture %in% "N/A") #even fewer observations, and none that can be used since there are no attached morph abundances - so soil moisture cannot be used in our model 

```

Since the Dates Months and Years are in the dataset as characters, lets use lubridate to alter these columns and prepare them to be analyzed

```{r}
present_salamanders <- present_salamanders %>% #Using Lubridate to create dates and month columns in correct format 
   mutate(Date = dmy(Date),  
         Month = month(Date, label = TRUE)) 
present_salamanders 
```

Next in our wrangling, we need to calculate Salamander abundances for each ecomorph type.

```{r}
#Calculating the redback abundances 
redback_abundance <- present_salamanders %>%
  mutate(Redback.form.of.Eastern.Redback.Salamander.Count = as.numeric(Redback.form.of.Eastern.Redback.Salamander.Count)) %>%
  group_by(Year, Month, Plot.Name) %>%
  summarize(redback_abun = sum(Redback.form.of.Eastern.Redback.Salamander.Count, na.rm = TRUE))
# Then check the result
head(redback_abundance)
#Calculating the leadback abundances
leadback_abundance <- present_salamanders %>%
  mutate(Leadback.form.of.Eastern.Redback.Salamander.Count = as.numeric(Leadback.form.of.Eastern.Redback.Salamander.Count)) %>%
  group_by(Year, Month, Plot.Name) %>%
  summarize(leadback_abun = sum(Leadback.form.of.Eastern.Redback.Salamander.Count, na.rm = TRUE))
head(leadback_abundance)

```

Now, we want to combine these calulated abundances into one dataframe, joining the data correctly. We do this by joining by shared Year, Location, and Month to ensure the data points are matched correctly.

```{r}
#Combining the two abundances into one data frame and joining by shared rows 
total_abundances <- full_join(redback_abundance, leadback_abundance, 
                             by = c("Month", "Year", "Plot.Name"))
total_abundances 
```

Next, we want to incorporate average temperatures across month and year, since we are interested in how temperature affects abundance specifically. We group by the same columns as earlier, month, year and location.

```{r}
#find the average temperature for a given month, year, and loaction
temp_by_month_and_year <- present_salamanders %>%
  filter(!Air.Temperature..degC. %in% "N/A") %>%
  mutate(Air.Temperature..degC. = as.numeric(Air.Temperature..degC.)) %>%
  group_by(Year, Month, Plot.Name) %>%
  summarize(avg_temp = mean(Air.Temperature..degC., na.rm = TRUE))

temp_by_month_and_year$Year <- as.numeric(temp_by_month_and_year$Year) #convert this to numeric
temp_by_month_and_year
```

Now join the average temperature data to the abundances data:

```{r}
#join the temperatures with the abundance data 
total_abundances$Year <- as.numeric(temp_by_month_and_year$Year) #convert this to numeric


abun_and_temps <- full_join(total_abundances, temp_by_month_and_year,
                             by = c("Year", "Plot.Name", "Month"))

abun_and_temps
```

3.  Data visualization

Now that we have our data correctly situated, we aimed to visualize the raw data of abundance against temperature for each ecomoprh separately.

```{r}
ggplot(abun_and_temps, aes(x = avg_temp, y = redback_abun)) +
  geom_point(size = 3, alpha = 0.7) + 
  # facet_wrap(~Plot.Name) +  
  labs(x = "Average Temperature", 
       y = "Redback Abundance",
       title = "Redback Salamander Abundance with Average Temperature") +
  theme_minimal() 


ggplot(abun_and_temps, aes(x = avg_temp, y = leadback_abun)) +
  geom_point(size = 3, alpha = 0.7) + 
  # facet_wrap(~Plot.Name) +  
  labs(x = "Average Temperature", 
       y = "Leadback Abundance",
       title = "Leadback Salamander Abundance with Average Temperature") +
  theme_minimal() 

```

From this, we can see a general trend that as the average temperature at a given time increases, the observed abundances appear to be increasing. Based on the visualization, we decided to try to fit both a linear model and a quadratic model. There appears to be some data points at higher temperatures that indicate lower abundance, and that abundance increases with temperature but then decreases, which could suggest that a quadratic regression would fit the data better. But we will test for this directly in our next steps.

4.  Fitting the models to our data First, we want to combine the two abundances into one column, and create a new column specifying ecomorph type. This makes it simpler in the long run, when we want to visualize ecomorph as a covariate directly and interaction terms incorporating ecomorph type. This also helps to include more observations into our model, which can help increase the statistical power.

```{r}
rbind(
  data.frame(abundance = as.numeric(abun_and_temps$leadback_abun), eco_morph = "leadback", temp = as.numeric(abun_and_temps$avg_temp)),
  data.frame(abundance = as.numeric(abun_and_temps$redback_abun), eco_morph = "redback", temp = as.numeric(abun_and_temps$avg_temp))
) -> data

```


Now, lets fit linear models first to the data:

```{r}
lmmodel <- glm(abundance ~ temp + eco_morph + temp*eco_morph, data = data, family = "poisson")
summary(lmmodel)
AIC(lmmodel)
```

Let's fit quadratic:

```{r}
quadmodel <- glm(abundance ~ temp + eco_morph+ I(temp^2) + temp*eco_morph , data = data, family = "poisson")
summary(quadmodel)
AIC(quadmodel)
```

Based off of this, our quadratic model appears to be the better fit to the data. We based this by comparing the AICs, which assesses which model is a better fit to the data. The lower the AIC, the superior the model fit.

Now, we want to create a prediction grid for our model to predict how abundance changes in response to temperature changes for each ecomorph.

```{r}
new_data <- expand.grid(
    temp = seq(min(data$temp), max(data$temp), length.out = 100),
    eco_morph = unique(data$eco_morph)
)
```

Here we are creating a column for predicted values, that we can add a line for in the model.

```{r}
predictions <- predict(quadmodel, newdata = new_data, type = "response")
new_data$predicted <- predictions
```

```{r}
plot(abundance ~ temp, data = data, col = as.factor(eco_morph), pch = 16, 
     main = "Temperature Effects on Salamander Abundance",
     xlab = "Temperature (°C)",
     ylab = "Abundance")
     
#adding prediction lines for each ecomorph 
  for(morph in unique(data$eco_morph)) {
    subset_data <- new_data[new_data$eco_morph == morph,]
    lines(subset_data$temp, subset_data$predicted, 
          col = as.factor(morph))
    }
legend("topright", 
       legend = unique(data$eco_morph),
       col = as.factor(unique(data$eco_morph)), 
       pch = 16)

```

Here, we can see lines showing the predicted abundances at a given temperature for each ecomorph.

5.  Now, we want to analyze other potential predictors for our model, such as precipitation. But first, we have to backtrack and wrangle the data to clean it for analysis.

```{r}
precipitation <- salamanders %>%
  slice(-c(1)) %>%
  mutate(precipitation = as.numeric(str_extract(Precipitation.in.the.Last.24.hours..mm., "[0-9]+"))) %>%
  mutate(Date = dmy(Date), 
         Year = as.numeric(Year),
         Month = month(Date, label = TRUE)) 

```

Now, we want to calculate the average precipitation for a given year, month, and location.

```{r}
precipitation <- precipitation %>%
  group_by(Year, Month, Plot.Name) %>%
    summarize(avg_precip = mean(precipitation, na.rm = TRUE))
```

Now, we want to join the precipitation to the abundance temps and precipitation data, and ensure that any NAs are filtered that could have been reintroduced in combining datasets.

```{r}
abun_temps_precip <- full_join(abun_and_temps, precipitation,
                             by = c("Year", "Plot.Name", "Month"))

abun_temps_precip <- abun_temps_precip%>%
  filter(!is.na(avg_temp)) %>%
  filter(!is.na(leadback_abun)) %>%
  filter(!is.na(leadback_abun))


```
Lets visualize how abundances change in response to precipitation with the raw data: 
```{r}
abun_temps_precip %>%
  ggplot(aes(x = avg_precip, y = redback_abun)) + 
  geom_point(size = 3, alpha = 0.7) + 
  # facet_wrap(~Plot.Name) +  
  labs(x = "Average Precipitation", 
       y = "Redback Abundance",
       title = "Redback Salamander Abundance with Average Precipitation") +
  theme_minimal() 

abun_temps_precip %>%
  ggplot(aes(x = avg_precip, y = leadback_abun)) + 
  geom_point(size = 3, alpha = 0.7) + 
  # facet_wrap(~Plot.Name) +  
  labs(x = "Average Precipitation", 
       y = "Leadback Abundance",
       title = "Leadback Salamander Abundance with Average Precipitation") +
  theme_minimal() 
```
There appears to not be a lot of observations at various precipitation levels and lots of clustering, but we will incorporate it in our model as a covariate to see if there is any statistically significant relationship. 

Now to create the data frame that will be used to fit this new model:

```{r}
rbind(
  data.frame(abundance = as.numeric(abun_temps_precip$leadback_abun), eco_morph = "leadback", temp = as.numeric(abun_temps_precip$avg_temp), precipitation = as.numeric(abun_temps_precip$avg_precip), location = abun_temps_precip$Plot.Name),
  data.frame(abundance = as.numeric(abun_temps_precip$redback_abun), eco_morph = "redback", temp = as.numeric(abun_temps_precip$avg_temp), precipitation = as.numeric(abun_temps_precip$avg_precip), location = abun_temps_precip$Plot.Name)
) -> data

```

```{r}

quadmodel2 <- glm(abundance ~ temp +  I(temp^2) + eco_morph + temp*eco_morph + precipitation, data = data, family = "poisson")
summary(quadmodel2)

AIC(quadmodel2)
```

Introducing precipitation as a covariate decreases the AIC by around 4 units, so we will incorporate precipitation in the models going forward. 

6.  Sampling bias and Ensuring Independence of Observations

Since we decided to use abundance as a response variable, this could raise the concern that abundance observations are not independent of each other (or could be attributed to abundance patterns from previous years.) To negate this, we first visualized what were the general abundance patterns year to year for the morphs.

```{r}
#First some wranging to calculate the yearly abundances 
new_abun_temp_precip <- abun_temps_precip %>%
  group_by(Year) %>%
  summarize(yearly_abun_lead = sum(leadback_abun),
            yearly_abun_red = sum(redback_abun)) %>%
  arrange((Year))
#lets first look at how abundance patterns are changing annually 

plot(new_abun_temp_precip$Year, new_abun_temp_precip$yearly_abun_lead, type = 'l', 
     col = 'blue3', 
     main = "Salamander Abundance by Year and Ectomorph",
     xlab = 'Year',
     ylab = 'Salamander Abundance',
    ylim = range(c(0, 450)))
lines(new_abun_temp_precip$Year, new_abun_temp_precip$yearly_abun_red, col = 'red3')
legend("topright", 
       legend = c("Leadback", "Redback"),
       col = c("blue3", "red3"),
       lty = 1)
```

From this we can visually see that there doesn't seem that abundance is steadily increasing with year (as we could maybe expect to see if abundances were not independent of each other). In other words, abundance being higher in one year does not appear to mean abundance is going to be higher in the year following it. However, we will account for this in our model by including year as a covariate.

Further, through our project concerns of sampling bias have come up. We have evidence to suggest that sampling effort has not remained consistent across the duration of the study.

```{r}
unique(present_salamanders$Date)
```

Looking at all the unique dates where observations were recorded from, we can see that the dates across the duration of the study are very few overall. Certain months are sampled more consistently over time (May) while others (June) are sampled less consistently. This leads us to question if there are simply less observations or sampling periods at higher temperatures, due to more of the sampling effort taking place at months that likely wouldn't have higher temperatures (\> 25 degrees Celsius).

```{r}
testforfrequency <- present_salamanders %>%
  mutate(Air.Temperature..degC. = as.numeric(Air.Temperature..degC.)) %>%
  mutate(Year = as.numeric(Year)) 
  

testforfrequency %>%
  ggplot() +
  geom_histogram(aes(x = Air.Temperature..degC.), bins = 30) + 
  labs(title = "Frequency of Observations by Temperature", x = "Temperature (°C)", y = "Number of Observations") +
  theme_minimal()
```

Here we see that the majority of the observations takes place within the 10-20 degree celsius range, with very few observations taking place in the 25-30+ degree range. Since this histogram accounts for all observations (the total number of recorded observations/samples at a given temperature range, not abundance), it raises the concern that the trends we see with higher abundances of salamanders are attributed less so to a temperature response and more so to the increased sampling effort in certain conditions over others.

We now want to add year as a covariate in the model (to ensure that abundance observations are independent of each other) and number of observations (which acts as measure of sampling bias by summing the total number of observations in a given condition- grouped by month, year, and location)

```{r}
present_salamanders1 <- present_salamanders%>%
  group_by(Year, Month, Plot.Name) %>%
  summarize(number_of_observations = n()) #creating number of observations
present_salamanders1


num_observations<- data.frame(present_salamanders1$Year, present_salamanders1$Month, present_salamanders1$Plot.Name, present_salamanders1$number_of_observations)
num_observations <- num_observations %>%
  mutate(Year = as.numeric(num_observations$present_salamanders1.Year)) %>%
  mutate(Month = present_salamanders1.Month) %>%
  mutate(Plot.Name = present_salamanders1.Plot.Name) %>%
  mutate(number_of_observations = present_salamanders1.number_of_observations) #some wrangling to simplify column names

abun_temps_precip <- full_join(
  full_join(abun_and_temps, precipitation, by = c("Year", "Plot.Name", "Month")),
  num_observations, 
  by = c("Year", "Plot.Name", "Month")
) #joining the data frames


abun_temps_precip <- abun_temps_precip%>% #filtering any NAs added by combining dataframes 
  filter(!is.na(avg_temp)) %>%
  filter(!is.na(leadback_abun)) %>%
  filter(!is.na(leadback_abun)) 

abun_temps_precip <- abun_temps_precip[, -c(8, 9, 10, 11)] #removing redundant columns

colnames(abun_temps_precip) #verifying accurate column removal 

```

Now we want to create the final dataframe to analyze the model accounting for year and sampling bias:

```{r}
rbind(
  data.frame(abundance = abun_temps_precip$leadback_abun, eco_morph = "leadback", temp = as.numeric(abun_temps_precip$avg_temp), precipitation = as.numeric(abun_temps_precip$avg_precip), year = as.numeric(abun_temps_precip$Year)), 
  data.frame(abundance = abun_temps_precip$redback_abun, eco_morph = "redback", temp = as.numeric(abun_temps_precip$avg_temp), precipitation = as.numeric(abun_temps_precip$avg_precip), year = as.numeric(abun_temps_precip$Year))) -> data
  
rbind(
  data.frame(data, num_observations = as.numeric(abun_temps_precip$number_of_observations, Month = abun_temps_precip$Month))
) -> data


new_model <-  glm(abundance ~ temp +  I(temp^2) + eco_morph + temp*eco_morph + precipitation + num_observations + year, data = data, family = "poisson")
summary(new_model)
plot(new_model) 
AIC(new_model)
```

This new model now has the lowest AIC, with the number of observations, temperature, interaction of temperature and ecomorph, ecomorph type, and year having extremely low probabilities that the true effect is 0 (albeit some of these coefficients appear numerically really small). To further assess the impacts each covariate has on the model, we want to now perform some analyses to observe how each covariate uniquely affects abundance when all other covariates are held constant.

First, what happens when we look specifically at how sampling effort (number of observations) affects observed abundances?

```{r}
# Define fixed values for predictors - define sampling effort
new_observations <- seq(min(data$num_observations), max(data$num_observations), by = 10) 
fixed_temp <- mean(data$temp) # Hold temperature constant at the average 
fixed_year <- median(data$year) # Hold year constant using the median year
eco_morph <- c("redback", "leadback") 
precipitation <- mean(data$precipitation) #fixed precipitation at mean precipitation 


new_data <- expand.grid(
  temp = fixed_temp, 
  eco_morph = eco_morph, 
  num_observations = new_observations, 
  year = fixed_year,
  precipitation = precipitation
)

new_data$predicted_abundance <- predict(new_model, new_data, type = "response")


ggplot(data, aes(x = num_observations, y = abundance, color = eco_morph)) +
  geom_point() +  
  geom_line(data = new_data, aes(x = num_observations, y = predicted_abundance, color = eco_morph)) +  
  labs(
    x = "Sampling Effort (Number of Observations)",
    y = "Abundance",
    color = "Ecomorph"
  ) +
  ggtitle("Effect of Sampling Effort on Abundance") +
  theme_minimal()

```

This graph helps visualize that when all other covariates are held constant, there is a clear positive relationship between sampling effort and observed abundances, with the prediction line for each ecomorph displayed.

Now, we want to look at how abundance patterns with year change, with all other covariates held constant.

```{r}
#Change fixed params
fixed_observations <- max(data$num_observations) #holding number of observations constant at highest number 
changing_year <- data$year 

new_data <- expand.grid(
  temp = fixed_temp, 
  eco_morph = eco_morph, 
  num_observations = fixed_observations, 
  year = changing_year, 
  precipitation = precipitation
)

new_data$predicted_abundance <- predict(new_model, new_data, type = "response")


ggplot(data, aes(x = year, y = abundance, color = eco_morph)) +
  geom_point() +  
  geom_line(data = new_data, aes(x = year, y = predicted_abundance, color = eco_morph)) +  
  labs(
    x = "Year",
    y = "Abundance",
    color = "Ecomorph"
  ) +
  ggtitle("Effect of Year on Abundance") +
  theme_minimal()

```

From this, we can see that when all other covariates are held constant, abundance is overall predicted to slightly increase with year. If we accept this relationship as significant at alpha \< 0.05, this could call into question a fundamental assumption of our model, which is that abundances are independent year to year and not dependent on each other. However, given the earlier analysis showing the extent of the relationship with sampling effort, and the and argue that the relationships we observe are better explained by the systemic bias and inconsistencies in sampling effort across locations, months and times, and due to random variation year to year, rather than a lack of independence in the abundance observations.