---
title: "Fluoxetine and Crab Behavior"
author: "Leah Shotwell"
date: "5/9/2022"
output: html_document
---

IMPORTANT: Please make sure the downloaded Excel data is saved as "EE_crab_behav_data.csv". The code will not work if it is named anything else!

First, let's make sure all the necessary packages are installed on the computer.

```{r install}

install.packages("readr", "tidyr", "dplyr", "ggplot2", "tidyverse", "tinytex", "lubridate", "readxl", "ggpubr", "moments")


```

Now that all the packages have been installed, let's load them in to this document.


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(tinytex)
library(lubridate)
library(readxl)
library(ggpubr)
library(moments)
```

In this lesson, we will use a multiple linear regression model to assess the relationship between the "still" behavior and two other behaviors. More specifically, we will analyze the occurrence of foraging and active behaviors in the studied crabs.

1. What is a multiple linear regression model? How does it differ from a simple regression model?

```{r import, message=FALSE, warning=FALSE}
EE_crab_behav_data <- read_csv("~/QUBES Project/EE_crab_behav_data.csv")
view(EE_crab_behav_data)
# reading in the data file

```

For our regression, we are going to focus solely on the recorded rates of still behavior exhibited by the crabs in the presence of predators. More specifically, we want to determine the strength of the relationships between foraging and active behaviors (independent variables) and stillness/inactivity (denoted as Still in the data). To simplify our analysis, we will limit our data to only larger male crabs (Sex = M and Status = Dom) and will only use observations from the nighttime (when the crabs are more active). 
```{r data wrangling}

filtered_data <- filter(EE_crab_behav_data, Sex == "M", Status == "Dom", Time == "Night")

# filtering the data to only include the datasets we want to analyze
```

## Checking Assumptions 1-3

First, we need to check the first three assumptions: independence of observations, normality, and linearity. We will wait until after the formation of the model to check the last assumption (homoscedasticity). To begin, let's complete a correlation test on the two independent variables we want to study.
```{r independence of observations}
cor(filtered_data$Foraging, filtered_data$Active)
# assumption number 1
```
With a value of 0.0167, these two variables are not highly correlated, therefore meaning that our assumption of the independence of observations is being met.


Next, let's make a histogram to confirm that the dependent variable (mortality rate, denoted as Pred_Kill) is distributed normally. Remember, we are ideally looking for something similar to a bell curve.
```{r normality}
hist(filtered_data$Still)
# assumption 2
```
At first glance, this histogram does not appear to be normally distributed. Let's check to see if the data is positively or negatively skewed.

```{r skew check}

skewness(filtered_data$Still, na.rm = TRUE)

```
This negative value means that our data is negatively/left skewed. Generally, this means that Mode > Median > Mean. Thankfully, the value itself is small, denoting a minuscule difference from a normal distribution. Even after testing out log transformed and square-root transformed models, this original model is still the least skewed. Therefore, we will not be transforming our data.

Now, let's create scatterplots for both of our variables to assess linearity.
```{r linearity}
plot(Still ~ Foraging, data=filtered_data)
plot(Still ~ Active, data=EE_crab_behav_data)
# assumption 3
```
The relationships in both of these graphs appear to be linear. Therefore, we can continue with our analysis.

## Linear Model

Now, let's form our linear model.

```{r formation of linear model}
mortality.rate.lm<-lm(Still ~ Foraging + Active, data = filtered_data)

summary(mortality.rate.lm)


```
2. In this model, which variables are independent? Which are dependent?

Let's interpret the output of this summary. The estimated effects of the foraging and active variables were -0.953 and -1.000, respectively. This means that for every 1 percent increase in the still behavior, foraging decreases by .95% and active by about 1%. This makes sense, as a crab cannot be both still and foraging or still and active.

For both variables, the standard error is small (.01994 and .01445), and the t-statistic is large (-47.81 and -69.23). Additionally, the p-value is very small (2.2e-16). These characteristics limit the probability of chance interfering with our results.



## Checking Assumption 4
Now that we have our linear model, let's confirm that our model meets the assumptions of homoscedasticity (homogeneity of variances).
```{r homoscedasticity}
par(mfrow=c(2,2))
plot(mortality.rate.lm)
par(mfrow=c(1,1))
# assumption 4
```
For our purposes, let's focus on the top two graphs. In the residuals vs fitted graph, we are looking for the red line (mean of the residuals) to be horizontal and approximately zero. This requirement is satisfied, so we can move on to the normal q-q plot. In this graph, we can see that the left side of the data deviates from the straight dotted line (otherwise known as the standard normal variate). This indicates that the data is left skewed/negatively skewed, something we already knew about this data. The right side of the data appears normal. From this, we can say that the assumption of homoscedasticity is being met to the best of our ability.

Congrats! All of the previous steps contributed towards the creation of a multiple regression linear model.

3. What do the results tell us about the relationship between still and foraging + active?