---
title: "Investigating the Drivers of Western Monarch Decline Using PLSR"
author: "Jackie Luu"
date: "Created on 2023-04-16"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

#If you don't have the necessary packages listed below, remove the hashtag to install the package.
#install.packages("pls")
#install.packages("tidyverse")
#install.packages("caret")
#install.packages("vip")

#Load the packages. 
library(pls) #Used for creating the PLS model. 
library(tidyverse) #Used for manipulating the data. 
library(caret) #Helps create the Variable Importance Plots in conjunction with the vip package. 
library(vip) #Used for creating the Variable Importance Plot. 

```

Before proceeding with the analysis, please read the results and discussion sections of the focal paper if you haven't already. Focus on the results for the a priori PLSR analysis, but I encourage you to read the other sections as well, since it highlights how PLSR compares to other analyses. 

```{r}
#Load in the data. 
MonarchData <- read_csv("MonarchData.csv")
```

```{r}
#View the first six rows of the data.
head(MonarchData)
```

Here is a summary of the metadata describing each of the columns. 

Year: The years included in the dataset ranging from 1982 to 2016.

Log_N: The natural log of the estimated western monarch abundance in the overwintering habitat for the respective year. 

Log_PrevN: The natural log of the estimated western monarch abundance for the previous year. 

Coast_Dev: The estimated proportion of developed lands in the overwintering habitat in coastal California.

Br_PDSI: The Palmer Drought Severity Index averaged from January to September in the breeding habitat.

Br_Temp: The average monthly maximum temperature in degrees Celsius from June to August in the breeding habitat. 

Gly_Ag: The summed amount of glyphosate in pounds applied for agricultural purposes in California for the respective year.

Gly_NonAg: The summed amount of glyphosate in pounds applied for non-agricultural purposes in California for the respective year.

NN_Ag: The summed amount of neonicotinoids in pounds applied for agricultural purposes in California for the respective year.

NN_NonAg: The summed amount of neonicotinoids in pounds applied for non-agricultural purposes in California for the respective year.

Coast_P: Precipitation in inches averaged across the overwintering habitat in coastal California from December to February.

Coast_T: The minimum monthly temperature in degrees Celsius averaged across the overwintering habitat in coastal California from December to February.


The original dataset was manipulated to remove columns that are not relevant for the a priori PLSR. Columns were also renamed for simplicity. The Log_PrevN column was created using the Year and Log_N columns provided in the dataset. Data for the year 1981 was removed, since the focal paper only analyzes data from 1982 to 2016. 

With this new dataset, we have most of the information we need to conduct an a priori PLSR. However, the amount of pesticides applied is separated between agricultural and non-agricultural applications. Following the methods used in the focal paper, we want to sum the agricultural and non-agricultural applications to get the total amount of glyphosate used and the total amount of neonicotinoids used in California each year. 

```{r}
MonarchData$Gly <- MonarchData$Gly_Ag + MonarchData$Gly_NonAg #Creates a new column representing the sum of the glyphosate used for both agricultural and non-agricultural purposes

MonarchData$NN <- MonarchData$NN_Ag + MonarchData$NN_NonAg
#Creates a new column representing the sum of the neonicotinoids used for both agricultural and non-agricultural purposes

#We can remove the original four columns from the data frame.
MonarchData <- subset(MonarchData, select = -c(Gly_Ag, Gly_NonAg, NN_Ag, NN_NonAg))

#Check the new data to see the new columns.
head(MonarchData)
```

Our data is now ready to be used to create a PLS model. The log of the western monarch abundance is our dependent variable and the remaining variables in our dataset (except year) are the eight predictors.  

```{r}
#Create a PLS model using the leave-one-out cross-validation method.
MonarchModel <- plsr(Log_N ~ Log_PrevN + Coast_Dev + Br_PDSI + Br_Temp + Coast_P + Coast_T + Gly + NN, data = MonarchData, scale = TRUE, validation = "LOO")

#View the summary statistics for the PLS model.
summary(MonarchModel)
```

After evaluating the RMSE and percent variance explained, how many components do you think should be included in the final PLS model?

Now, let's see the number of components suggested by the selectNcomp function. Try running the selectNcomp function yourself using the randomization method. Also, try to create a cross-validation plot using that function.

```{r}
#Find the optimal number of components.

```

```{r}
#We can now create the final PLS model using two components. 
FinalModel <- plsr(Log_N ~ Log_PrevN + Coast_Dev + Br_PDSI + Br_Temp + Coast_P + Coast_T + Gly + NN, data = MonarchData, scale = TRUE, validation = "LOO", ncomp = 2)

#View the summary statistics for the new model. 
summary(FinalModel)
```

How much variation in the monarch abundance is explained by the first component? How much variation in the monarch abundance is explained by the second component?

We will now plot and interpret a scoreplot. 

```{r}
#Create a scoreplot. 
scoreplot(FinalModel, xlab="Comp 1 (64.21%)", ylab="Comp 2 (18.52%)", cex = (MonarchData$Log_N)/3) #cex is used to set the size of the points proportional to the log of the monarch abundance for each year
```

In a scoreplot, points are clustered based on their similarity. Points close to the average value will appear near the origin of the score plot. Whereas scores further out are more extreme values. 

For this example, each point represents a year ranging from 1982 to 2016. Years that have a monarch abundance close to the average value for all the years will be closer to the origin. Years with larger monarch abundances are clustered towards the top right. Years with smaller monarch abundances are clustered around the left side of the plot. Scoreplots are usually used to observe the structure of our data. We can locate similar observations and any outliers using the scoreplot. 

We will now plot the loadings for the PLS model. The loadings represent how much a variable contributes to a component. 

```{r}
#Plot loadings.
loadings(FinalModel)
```

Briefly observe the magnitude and direction of the loadings. 

Finally, we will create a biplot representing both the scores and loadings. 

```{r}
biplot(FinalModel, cex = 0.6, main = "", var.axes = TRUE)
```

A biplot combines a scoreplot (represented by the numbers) and a loading plot (represented by the arrows). For the scores, the number 1 starts represents the first year in our dataset (1982) and number 35 represents the last year in our dataset (2016).

The length of the arrows represent the magnitude of the loadings. Since the loadings represents how much the variable contributes to a component, a longer arrow has a stronger influence on a component than a shorter arrow. 

The direction/orientation of the arrow *along* a component axis represents which component the variable influences. For example, let's look at Glyphosate and Coastal Development. For those two variables, we can see that the arrows are nearly parallel to component 1 and perpendicular to component 2. This indicates that Glyphosate and Coastal Development have a stronger correlation with component 1 than component 2. 

The direction of the arrow also indicates whether the relationship with the component is positive or negative. We can see that the Glyphosate and Coastal Development loadings point towards the left where there are negative values for component 1. This suggests that Glyphosate and Coastal Development have a *negative* correlation with component 1. 

Remember that component 1 explains 64.21% of the variation in the log of the monarch abundance. Component 2 explains 18.52% of the variation in the log of the monarch abundance. Since component 1 explains the majority of the variation in the monarch abundance, this suggests that variables that have a strong correlation with component 1 will likely be a strong driver for monarch abundance. Based on the biplot, do land use variables or climate variables appear to have a stronger effect on monarch abundance (i.e. which variables have a strong correlation with component 1)?

We will now create a Variable Importance Plot (VIP) for the PLS model. A VIP shows the relative contribution of each predictor to the first two PLS components. A high VIP score suggests that a predictor could potentially be a driver for the changes in western monarch abundance. 

```{r}
#Create a Variable Importance Plot. 
vip(FinalModel)
```

While the numbers for our VIP are different from the focal paper, let's take a look at how the overall trends for the plot compare to the focal paper. We can see that the log of the monarch abundance in the previous year has the strongest influence on the components in our model. Looking at the other variables in the VIP, do land use or climate factors contribute more to our model?