---
title: "Amphibian size on Population"
author: "Madison Whitehurst"
date: "4/30/2021"
output: html_document
---

First you will need to load in all the necessary libraries, if you do not have the libraries make sure you install them by using the install.packages function. 
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2) #Visualization tools
library(readr) #Allows R to read your csv files
library(tidyverse) #Package designed for data science to make the data clean and easy to use
library(dplyr) #Allows for the manipulation of data
```

## Female Wood Frogs

### Import and check data
The next step is to import your data. The data is in a CSV file called *female_wood_frog_data.csv*. Make sure that when you download the data it is in the same repository that you are working in and that the name is correct (number might be added to the end).
```{r}
female_wood <- read_csv("female_wood_frog_data.csv")  #This function allows for R to read the data and import it into Rstudio
summary(female_wood)
#Lets make the data more presentable. We just mutated the names, we changed POP to Population. This just lets us be able to understand the data better. We also filtered for ages 3 & 4. 
female_wood %>%
  mutate(Population = POP) %>% #mutate allows you to change the name of your variables. 
  filter(Age %in% c(3,4)) %>%  #filter allows you to select certain values within your data
  dplyr::select(Population, Weight, SVL, Age) -> female_wood #We created a new dataframe with only specifically selected data.  
summary(female_wood)
```
When completing ANOVA's you need to have factored variables. With this being said we need to convert our two variables into factors (a grouping variable). 
```{r}
#Using the as.factor function allows us to convert Age to a factor. The is.factor function allows us to see if the code actually worked. 
female_wood$Age <- as.factor(female_wood$Age)
female_wood$Population <- as.factor(female_wood$Population)
is.factor(female_wood$Age)
#Check to see if Population was successfully changed to a factor. 
```

### Visualizing the data 
Before we do anything, we should take a look at the data just to get a glimpse at how it is acting.
```{r}
ggplot(data = female_wood, aes(x = Population, y = SVL, color = Age)) + geom_boxplot() #ggplot allows us to visualize our data. We are using a boxplot because this is the best way of displaying the distribution of data. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
```
The main purpose of this plot is to help us to understand what the treatments are doing. We want to quickly assess things like: How big are the main effects? What direction do they work in? Is there likely to be an interaction? It looks like that the Snout-vent length tends to be larger on older frogs. The more interesting observation is that it looks like the Snout-vent length for ages tend to be higher for Rural and Suburban Areas than for Urban areas.

### Constructing the Two-Way Anova
At this point, we need to formally specify the null hypothesis we are testing. 
  
  Ho: The effect of population type (urban, suburban, or rural) on the snout-vent length does *not*        depend on the age of the frog. 
  Ha: The effect of population type (urban, suburban, or rural) on the snout-vent length *does* depend on   the age of the frog.

This is the hypothesis of the interaction, but there are also hypotheses for the main effects: size does not depend on age, size does not depend on site, and the effect of age on size does not depend on site, etc. 

For the purpose of this exercise, we are looking mainly at the interaction effect but it is important to know the other hypotheses. This means that we want to fit a model with the interaction of Population and Age. To build the model we use a * symbol between the two explanatory variables. Specifically, if we write Population * Age, it will expand the combination of variable to read Population + Age + Population:Age, such that we have a main effect of Population, a main effect of Age, and the interaction (:) between Population and Age. As with all linear models, we use the lm() function, specifying this trick formula and the data frame:
```{r}
fw_model <- lm(SVL ~ Population * Age , data = female_wood) #The lm function is used to create our model,                                                               we just need to add in our interactions. 
```


### Examine Assumptions
We are now going to produce two diagnostic plots: a normal probability plot to evaluate the normality assumption and a scale-location plot to evaluate the constant variance assumption. The normal probability plot is a graphical tool for comparing a dataset with the normal distribution to assess whether the data is normally distributed. The data is plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line. Departures from this straight line indicate departures from normality.
```{r}
#Normal probability plot
plot(fw_model, which = 2, add.smooth = FALSE)
```
The points on this normal probability plot form a nearly linear pattern which indicates that the normal distribution is a good model for this data set. 

We are now going to look at the scale-location plot. This plot allows us to see whether or not the variability of the residuals is roughly constant within each group, if it follows the constant variance assumption. The constant variance assumption is that the spread of residuals is roughly equal per treatment level
```{r}
#Scale-location plot
plot(fw_model, which = 3, add.smooth = TRUE)
```
We’re on the lookout for a systematic pattern in the size of the residuals and the fitted values. Does the variability go up or down with the fitted values? There does not appear to be too much of a pattern, so it looks like the constant variance assumption is satisfied here. There are a few outliers (15, 17, and 36) which can severely affect normality and homogeneity of variance, so you have to be careful with outliers. With this being said, we can continue.

### Interpreting the ANOVA results
We are now going to calculate the degrees of freedom, sums of squares, mean squares, the F-ratio, and p-values for the main effects and the interaction terms through the use of the ANOVA function.
```{r}
#ANOVA Model
anova(fw_model)  #Using the ANOVA function allows us to actually run the ANOVA model. 
```
The ANOVA table presents a sums-of-squares analysis-of-variance table. Because we conducted and are analyzing an experiment, we have a specific hypothesis in mind: The effect of population type on snout-vent length depends on age. Testing this hypothesis is embodied in the Population:Age row. This table reveals that there is statistically significant variation in snout-vent length explained by allowing the effect of population type to vary by age (F = 6.4913; df = 2; p = 0.002346).
```{r}
#Summary Table
summary(fw_model)
```
We are now going to go over some of the important components of this summary table. *Call* shows what function and parameters were used to create the model, it shows the two-way anova model we created. *Residuals* are the difference between what the model predicted and the actual value of y. *Coefficients* are the interactions within our model and this includes their estimates, standard error, t statistic, and p-value. The *intercept* tells us that when all the features are at 0, the expected response is the intercept. The *standard error* is the standard error of our estimate, which allows us to construct marginal confidence intervals for the estimate of that particular feature. The *t-value* measures the size of the difference relative to the variation in your sample data. The *p-value* tells us the probability of observing a value. If this probability is sufficiently low (< 0.05), we can reject the null hypothesis. The *R-squared* tells us what proportion of the variance is explained by our model.These are the main important details of this summary model. 

### Understanding the model Graphically
Now that we have the results from the model, how should we go about interpreting the significant effects? Well we can use R to produce an interaction diagram for a two-way design.
```{r}
# Step 1. We need to calculate means for each treatment combination
fw_means <-  #We are creating a new data frame
  female_wood %>%  #Calling in our original data frame
  group_by(Population, Age) %>%   #remember to group by *both* factors
  summarise(MeanSVL = mean(SVL),  #Doing calculations to our data
            SE = sd(SVL)/sqrt(n())) 
fw_means
# Step 2. We then plot these as an interaction plot. We are visualizing what we created in step 1 by using ggplot and plotting the new dataframe created. 
ggplot(fw_means, aes(x = Population, 
                     y = MeanSVL, 
                     colour = Age, 
                     group = Age)) +   #Plotting population and their mean SVL, color coding based on age
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymin = MeanSVL - SE,
                    ymax = MeanSVL + SE), 
                    width = 0.1) +  #Creating error bars so we can see the data distribution
  theme_bw()
```
There is statistical evidence that the population source and age are not independent of each other, they depend on one another. Now lets look at this figure and see if we can interpret the main effects. It can be seen that as you go along the urban-rural gradient, the mean snout-vent length changes for each of its age group, 3 & 4. In one case, when looking at age 3 frogs, it seems like the population source matters quite a lot in determining the size of the frog at its given age. You can see that the urban population frogs are significantly smaller for their age than the other populations of frogs. But, when looking at age group 4, there isn't that much difference between the size of the frog from this age in all the different populations. With this being said, there is this support of the hypothesis that in populations of urban-sensitive wood frogs, they would have younger age-structures comprised of individuals displaying reduced size at metamorphosis in association with urbanization. 

## Female American Toads
Complete this section on your own to see if American Toads are different among the urban-rural gradient in size at different ages. Follow all the steps that were done for the wood frogs, the only difference is you will be using the toad data set *female_american_toad_data.csv*. 

### Import and check data
```{r}
female_toad <- read_csv("female_american_toad_data.csv")

summary(female_toad)
female_toad %>%
  mutate(Population = POP) %>%
  filter(Age %in% c(4,5)) %>%
  select(Population, Weight, SVL, Age) -> female_toad

  summary(female_toad)

head(female_toad)
#Factor the data
female_toad$Age <- as.factor(female_toad$Age)
female_toad$Population <- as.factor(female_toad$Population)
is.factor(female_toad$Age)
```

### Visualizing the data 
```{r}
ggplot(data = female_toad, aes(x = Population, y = SVL, color = Age)) + geom_boxplot()
```

### Constructing the Two-Way Anova
```{r}
ft_model <- lm(SVL ~ Population * Age , data = female_toad)
ft_model
summary(ft_model)
```

### Examine Assumptions
```{r}
#Normal probability plot
plot(ft_model, which = 2, add.smooth = FALSE)

#Scale-location plot
plot(ft_model, which = 3, add.smooth = TRUE)
```

### Interpreting the ANOVA results
```{r}
anova(ft_model)
```

### Understanding the model Graphically
```{r}
# Step 1. We need to calculate means for each treatment combination
ft_means <- 
  female_toad %>% 
  group_by(Population, Age) %>% # <- remember to group by *both* factors
  summarise(MeanSVL = mean(SVL),
            SE = sd(SVL)/sqrt(n())) 
ft_means
# Step 2. We then plot these as an interaction plot
ggplot(ft_means, aes(x = Population, y = MeanSVL, colour = Age, group = Age)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymin = MeanSVL - SE,
                    ymax = MeanSVL + SE), width = 0.1) +
  theme_bw()
```