######################## ASSIGNMENT INFO ########################
## Fish Habitat Selection   
## LAB _: Analyzing Resource Selection Functions
## Created 2023 by C. Teal and L. Diaz


####### NOTES FOR INSTRUCTOR

# This script is designed using the Jenney et al. (2023) modified dataset provided with the lesson. 

# If you are using data collected in the field, make sure the database sheet has been saved as a csv file as "Working data.csv" and that the format of the input data matches the format of the example data.

# Modify names of species and covariates as needed. 

####### NOTES ABOUT CODING IN THE R PROGRAM

# R is one of the most common softwares used for statistical analysis in all fields of research, as it is open access and is extremely flexible and user friendly.

# We have pre-coded everything you need to successfully answer the questions in the lab assignment, so you shouldn't need to change any of the code. However, we highly encourage running it line by line rather than all at once to better understand your analysis and more easily troubleshoot errors if any should occur. 

# TO RUN CODE: either select the line of code and hit "Ctrl+Enter" OR highlight the line of code and click "Run" on the top right of the script window.

# In the console, you should see the line of code you just ran, output of the code if called for, and any warning/error messages.

# A function is a command you wish to use on an object. For example, in plot(x), plot is the function and x is the object.

# R is case sensitive. If you change accidentally change or delete a function, refer to the original script for the correct case. R also does not like spaces. Use underscores, dots, or dashes instead. 

# Any text that is not commented out using a "#" symbol is included in the code and will lead to error messages if it is not meant to be in there. It is good practice to heavily annotate your code, so feel free to keep track of results using #comments. If you wish to keep a line of code for reference but do not want it to run, you may comment it out. 

# If you need help understanding the syntax of a function, you may type ?functionname into the code. For example:
?table

######################## SETUP ########################

####### SET WORKING DIRECTORY 

#Your working directory is the folder on your computer where you would like the script and associated input/output to live
setwd('C:/FILE PATH') 
#IMPORTANT: R only recognizes file paths that use either backslash "/" or double forward slash "\\". The default address copied from your computer will most likely only use one forward slash "\". Modify throughout, OR: 
#You can also go to Session > Set working directory > To source file location, assuming you want script output in the same folder as your script. 

####### READ IN DATA 

fish <- read.csv("Working data.csv")
View(fish) #this opens a new window 

####### GET TO KNOW THE DATA AND MODIFY AS NECESSARY

summary(fish) 
#It is important here to verify the class of each variable, as errors occur when, for example, continuous numerical values are stored as characters or vice versa. Categories, such as species, life stage, and macrohab, should be stored as characters. Numerical values should show the distribution of values (quantiles). 
#See "Metadata" in the example database for descriptions of the data and their units. 

#If you need to re-classify other variable types, use the following code:
#fish$Substrate <- as.numeric(fish$Substrate) #if going from character to number
#fish$LifeStage <- as.factor(fish$LifeStage) #if going from other class to character

table(fish$Species) #use to check for errors in input data (ex. misspelled cells that are seen as a separate category) and look at the spread of your data (does each category have a somewhat balanced sample size?)
table(fish$LifeStage)
table(fish$MacroHab)


######################## MODEL 1 ########################
#This model will evaluating the drivers of habitat selection for Species 1 (Roundtail Chub in the example data). 

#Here, because we are evaluating used habitat relative to available habitat, we can make inferences regarding the actual habitat preferences of Species 1. If we did not collect data at available points, we would not be able to make inference about the mechanisms driving habitat use via selection. 

#If using data collected in the field, we suggest using the species with the highest sample size. 

####### FILTER THE DATA 

#Give the model only the data you want to include
m1data <- subset(fish, Species == c("Roundtail Chub","Available"))
View(m1data) #verify that the filtered data has the rows you intended to include)
summary(m1data)

####### CHECK FOR CORRELATIONS IN CONTINUOUS DATA 

#If two continuous variables are highly correlated, the model may be biased or may not converge. Check for correlations and remove variables with a Pearson's correlation coefficient > |0.7|. When selecting which variable to remove, consider the biological hypotheses associated with each. It is a subjective decision. 

names(m1data) #checking to see which column numbers correspond to our continuous predictors 
cor(m1data[4:8]) #As you can see, none of the variables are problematically correlated. You can now move on to fitting a global model.  

####### FIT THE GLOBAL MODEL 

#For the purpose of this activity, we will only be fitting one model that includes all predictors of interest (commonly called the global model). In practice, we would formulate a set of models, each with different subsets of predictor variables that correspond with different hypotheses about the system, and compare them to find the model that best fits our data. By only looking at a global model, we are presenting a simplified version of this analysis in order to focus on the interpretation of model output.  

install.packages('lme4') #this function installs a package of statistical functions that are not included in the base R program. If you've already installed this package, comment it out of the code using "#". 
library(lme4) #this function tells R to load the package 

#here, we are forming an object, m1, which will be the generalized linear model (glm) results of our global model. 
#we specify the family as binomial to tell the model that Use is a binary variable
m1<-glm(Use ~ Depth + Velo + Substrate + Cover + MacroHab ,data = m1data, family= binomial)
#The intercept refers to the responses coded as 0s as well as the categorical variables included in the model. In this case, the intercept represents available habitat and used pool habitat. 
#If you hypothetically want to change which macrohabitat category is selected as the reference category, you can do so by altering the alphabetical order of the category names. For this activity, leave the reference category as is. 

####### INSPECT OUTPUT

m1 #while we can see the model results by simply calling the object we created, we can get a more organized and in-depth output by using the function 'summary'
summary(m1)
#use this output to fill in the table for model 1 - question #2 
#Estimate = beta coefficients
#Std. Error = SE
#Pr(>|z|) = p-value ; rather than copying the exact number, report the significance code level (ex. 0.001 if there are two asterisks **)

#there is no way to export the entire summary.glm output as a csv file, but you may export specific components, such as the beta-coefficients 
betas <- coef(m1)
write.csv(betas, "m1.betas.csv") #this saves the csv file to your working directory if no other file path is specified

####### INTERPRETATION HINTS

#The estimates for the predictor variables are relative to the intercept. 
#When analyzing selection, a positive effect of depth means that fish are using deeper habitats RELATIVE to what is available, a positive effect of velocity means fish are selecting faster habitats relative to what is available, etc. 
#A negative effect of a categorical variable (ex. macrohabitat_run) means that fish are selecting runs less than they are selecting pools.
#Food for thought- how can you use these relationships to describe avoidance and preference of different habitat features? What kinds of hypotheses would you come up with as to why fish are avoiding or preferring different habitats? 

######################## MODEL 2 ########################
#This model will be comparing the the used habitat between adult Roundtail Chub and adult Desert Suckers.

#Here, because we are not comparing habitat use and habitat availability for each species, we cannot make inferences about the mechanisms driving habitat use via selection. We can only make inferences about observable patterns in the habitats where the species were observed. 

#The code can easily be modified to instead compare used habitat between juvenile and adults of a single species if only one species was observed in the field. If using data collected in the field, we suggest using the two groups with the highest sample size.

#There will be less annotation associated with this model, so remember what you learned with model 1! 

####### FILTER THE DATA 

m2data <- subset(fish, Species == c("Roundtail Chub", "Desert Sucker"))
m2data <- subset(m2data, LifeStage == "Adult")
View(m2data) #verify that the filtered data has the rows you intended to include)
summary(m2data)

####### CHECK FOR CORRELATIONS 

cor(m2data[4:8]) 

####### MAKE DUMMY VARIABLES 

#Dummy variables code categorical variables (here, Species 1 and Species 2) into binary variables that are more compatible with coding a logistic model

install.packages("fastDummies") #this package efficiently adds dummy variables to your data frame
library(fastDummies)
m2data1<- dummy_cols(m2data, select_columns = 'Species')
m2data1 
#As you can see, this created two new columns in the data, where each species is coded as 0s or 1s. Because we only have two categories, we can ignore one of the columns, as the information included within one already exists in the other. 
#The species you do not use as your response variable will be your reference category/intercept. You want to make the species with the most data the intercept. In this case, that is Roundtail Chub.Therefore, the response variable will be "Species_Desert Sucker" where chub = 0 and sucker = 1. 

#You also want to rename the variable to remove spaces, which R does not like
names(m2data1) #to identify which column number we want to change
names(m2data1)[10] <- "Sucker"
names(m2data1) #verify that it has been changed

summary(m2data1) #Verify that the new dummy variable is correctly classified as an integer and reclassify if necessary

write.csv(m2data1, 'm2data1_sucker.csv') #it is good practice to save the data file you will use to fit the model. Again, this saves it to your working directory. If you'd like it somewhere else, specify the folder address. 

####### FIT THE GLOBAL MODEL 

library(lme4) #should already be loaded from model 1, but doesn't hurt to call it in again just in case

m2<-glm(Sucker ~  Depth + Velo + Substrate + Cover + MacroHab ,data = m2data1, family= binomial)
#In this case, the intercept represents sucker habitat and pool habitat. 

####### INSPECT OUTPUT 
summary(m2) #answers to model 2 - question #2

#if you wish to export the coefficients into a csv file, use the code from model 1. 


####### INTERPRETATION HINTS

#The estimates for the predictor variables are relative to the intercept. 
#When comparing habitat use between two groups, a positive effect of depth means that the response group (here, suckers) are using deeper habitats RELATIVE to the reference group (chub) are using, a positive effect of velocity means suckers are using faster habitats than chub, etc. 
#A negative effect of a categorical variable (ex. macrohabitat_run) means that suckers are using runs less than they are using pools