---
title: "Analyzing Tillage Practices on Squash Bees"
author: "Grace Lumsden-Cook"
date: "5/7/2022"
output: html_document
---

## First, lets load all necessary libraries in.  
```{r}
# For reading in the data
library(readr)
# For visualizing the data
library(ggplot2)
# For manipulating the data 
library(plyr)

```
If you receive an error message when trying to read a library in, you may need to install them (install.packages()) before moving ahead. 

## Lets load our data into R
```{r}
SB_resp <- read.csv("squash_bee_resp.csv")
SB_resp

```
The data is comprised of 242 squash bee observations on farms throughout Michigan between the years of 2017 and 2019 (between the months of June and September). 

**Notice that TillageType is a categorical factor with three levels, and the SquashBees is a numerical variable that is either zero or greater than zero.

### Exploring the survey response data ###

In this lesson, we will be looking at how squash bee abundance (count data) is affected by three types of tillage practices which vary in their levels of soil disturbance. Before we start analyzing this data using statistical modeling, lets look at the summary statistics of the raw data.    
```{r}
SB_sum <- ddply(SB_resp, "TillageType", summarise,
               N = length(SquashBees),
               mean = mean(SquashBees),
               median = median(SquashBees),
               sum = sum(SquashBees),
               sd = sd(SquashBees), 
               se = sd/sqrt(N)
               
)               
SB_sum
```
There looks to be less observations on farms that reported Full Tillage (n = 24) when compared to those that reported No (n = 114) and Reduced Tillage (n = 104) farming practices. 

## Question 1: Which tillage type resulted in the highest number of squash bee observations total?


Visualize the distribution of squash bees for each of the three tillage types using a boxplot.
```{r}
#These libraries are necessary for creating the boxplot.
library(viridis)
library(hrbrthemes)

SB_resp %>%
  ggplot( aes(x= TillageType, y=SquashBees, fill=TillageType)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Squash bee abundance by varying tillage types") +
    xlab("")
```
## Question 2: Does there seem to be a best and worse tillage type for promoting squash bee abundance? 


### Finding a model to fit our data ###

Because of the way that this survey was designed, there are a few statistical tests that cannot be used to test this relationship between tillage and squash bees. One of them being linear, or multiple regression (y = mx + b...). This statistical model would not work because it assumes that the variables are continuous, but obviously there cannot be negative count data.

### Do we have random effects? ###

Lets take a step back and think about the experimental design of this survey, and how it implicates the ways we analyze this data. 

We are trying to understand how well the predictor variable (tillage type) explains the response veriable (squash bee abundance). However, due to the data collection methods there is some randomness that must be built into the model in order to account for the "noise", or natural variability of the data. These random effects include spatial and temporal factors such as:

(1) Some counties may have more agriculture land, and thus more pollinators.
(2) Some counties may be highly developed, and have very little farmland for hosting squash bees.
(3) A large number of observations may have been made during the height of the foraging season for the bees.
(4) There are multiple observations for the same study sites (farms) throughout the three years.

How do we account for these effects, you may ask?

### Linear mixed effect model ###

Using a linear mixed effect model, we are able to account for this temporal and spatial variance mentioned above. These random effects, or grouping factors are written into the model by using the syntax (1|groupingfactor). 

```{r}
# Creating the model factors
date.sb <- as.factor(SB_resp$Date)
year.sb <- as.factor(SB_resp$Year)
tilltype.sb <- as.factor(SB_resp$TillageType)
county.sb <- as.factor(SB_resp$County)

```


```{r}
#For running the model
library(lme4)

# First, add in the spatial variability. Pay attention to the results for the random effects.
SB_mixed_spat <- lmer(SquashBees ~ tilltype.sb + (1|county.sb), data=SB_resp) 
summary(SB_mixed_spat)

```
The variance due to this random effect can be calculated by dividing the value for county.sb by the total variance. 
(1.657 / (6.222 + 1.657)) =
## ~20% of the total variance in the data is due to this spatial factor.


```{r}

# Lets add just the temporal random effects. 
SB_mixed_date <- lmer(SquashBees ~ tilltype.sb + (1|date.sb), data=SB_resp) 

summary(SB_mixed_date)

```
## Question 3: How much variance is due to the temporal random effect? 


```{r}
plot(SB_mixed_date)
plot(SB_mixed_spat)
```


```{r}
qqnorm(resid(SB_mixed_date))
qqnorm(resid((SB_mixed_spat)))
```
## Question 4: Looking at the previous plots, does either model appropriately fit the data? Why? (Hint: are there any patterns or clumping of points in the first, and do the points follow a straight line in the second plot)


### Are these random effects crossed or nested? ###

We still have about 50% of the variability unaccounted for by the two random effects that have been added in. While it is good that the random effects are explaining some of the "noise", there is a better way to fit these to the model using our understanding of how the data was collected.

A random effect can either be crossed or nested depending on the relationships between variables. 

Crossed random effects occur when the study subject(s) interact with different study areas (experimental units), thus causing the same subject to be sampled within multiple contexts. Nested random effects, on the other hand, occur when variables have environmental (seasonal, land use patterns, etc.) properties that can lead to unexplained similarities between observations.   

Date and County are not crossed because the squash bees tend to visit flowers that are close to their ground nests (within the same county). So, we can infer that these random effects are nested!


### Fitting the mixed effect model with the nested random effects ###

```{r}
# nested random effects are added in by using the syntax (1|Date/County) 

SB_nested_mixed <- lmer(SquashBees ~ tilltype.sb + (1|date.sb/county.sb), data=SB_resp) 
summary(SB_nested_mixed)

```
## Question 5: What is the explained variance attributed to the nested random effect?


** Note that the focal paper, Appenfellar et al, 2020, used a generalized linear mixed model (GLMMM) for their analysis of tillage types on squash bee visits instead of a LMM. This statistical approach adds onto linear mixed-effects modeling, and allows for not only analyzing data with non-independent variables, but also non-normal (Gaussian) distribution of data points. 
```{r}
hist(SB_resp$SquashBees)
```
They chose a GLMM because this data is not normally distributed, hence the large number of observations resulting in zero or less than two visits shown in the histogram above. It allowed them to create a model that was tailor-fit to the specific distribution of these data points, as well as the nested random effects that we account for in this lesson. 


## Let's assess the model fit ##

In order to access the fit, we must compare the full model with a reduced model that does not have any fixed effects (TillageType). We replace this predictor factor with "1"
```{r}

full_SB_mixed <- lmer(SquashBees ~ tilltype.sb + (1|date.sb/county.sb), data=SB_resp, REML = FALSE)

reduced_SB_mixed <- lmer(SquashBees ~ 1 + (1|date.sb/county.sb), data=SB_resp, REML = FALSE)

```


Using anova(), compare the two models. The point of this comparison is to show how well the mixed model with nested random effects predicts the squash bee abundance for a given tillage type.  
```{r}
anova(reduced_SB_mixed, full_SB_mixed)
```
## Question 6: Are the two models significantly different from eachother? (Hint: look at p-value) What does this mean for the fitness of the model?


```{r}
# Finally, load in the (emmeans) package and look at the estimated marginal means for this model. 
library(emmeans)

means_nestedmix <- emmeans(SB_nested_mixed, pairwise ~ tilltype.sb, adjust="fdr")
means_nestedmix

```

Plotting the results of this estimation helps to visualize the significance of the model outputs. Having values that overlap with zero indicate that there is less than 95% confidence in significance (or validity) of that estimation.   
```{r}

plot(means_nestedmix) + theme_bw() + 
  labs(x = "Estimated marginal mean for squash bee abundance", y = "Tillage Type")

```
## Question 7: Are there any significant relationships between tillage and squash bee abundance based on the different types of tillage? Any non-significant relationships?