---
title: "Working with plant phenology data and fitting a nonlinear model using least squares in R"
author: "Madeleine Bonsma-Fisher and Ahmed Hasan"
output:
  html_document:
    df_print: paged
bibliography: NEON_phenology.bib
urlcolor: blue
---

## Introduction

*Phenology* is the study of periodic or cyclic natural phenomena, 
and this dataset contains observations of the seasonal cycles of plants at three NEON sites in the US:
Blandy Experimental Farm ([BLAN](https://www.neonscience.org/field-sites/field-sites-map/BLAN)) 
and the Smithsonian Conservation Biology Institute
([SCBI](https://www.neonscience.org/field-sites/field-sites-map/scbi)) in Virginia, and the 
Smithsonian Environmental Research Center
([SERC](https://www.neonscience.org/field-sites/field-sites-map/serc)) in Maryland. 

In this tutorial, we'll plot some phenology data and fit an oscillatory model 
to determine when different species get and lose their leaves. 

## Load and organize data

### Clean up and merge the two data frames

The first part of this tutorial is based on a data cleanup from
https://www.neonscience.org/neon-plant-pheno-data-r. There are two data files 
we'll be combining: the first (`phe_perindividual.csv`) contains data for 
each individual organism observed such as its location and scientific name, 
and the second (`phe_statusintensity.csv`) contains observations of several 
phenophases and their intensity.

```{r, warning=FALSE, message=FALSE}
library(tidyverse)

# load the two data files 
ind <- read_csv('NEON-pheno-temp-timeseries/pheno/phe_perindividual.csv')

status <- read_csv('NEON-pheno-temp-timeseries/pheno/phe_statusintensity.csv')

names(ind)

names(status)
```


We'll remove the UID field which uniquely identifies each row -- we don't 
need to uniquely identify entries.

```{r}
ind <- select(ind, -uid)
status <- select(status, -uid)
```

Next, we remove duplicates that result from stacking tables over many months 
-- we don't need duplicate observations from the same date, tree, and site.

```{r}
ind_noD <- distinct(ind)
nrow(ind) - nrow(ind_noD) # how many rows were removed
```

```{r}
status_noD <- distinct(status)
nrow(status) - nrow(status_noD) # how many rows were removed
```

Some of the columns in `status_noD` that have the same name as those in `ind_noD` 
should be renamed so that we can join the two tables and keep those fields separate. 
When using `rename()`, remember that each new column name is written first,
followed by the previous name (i.e. `new_name = old_name`).

```{r}
# where is there an intersection of names?
sameName <- intersect(names(status_noD), names(ind_noD))
sameName

# rename status editedDate
status_noD <- rename(status_noD, editedDateStat = editedDate, 
        measuredByStat = measuredBy, recordedByStat = recordedBy, 
        samplingProtocolVersionStat = samplingProtocolVersion, 
        remarksStat = remarks, dataQFStat = dataQF)
```

Next, we'll convert the date column to R's `Date` object type. This
makes R correctly interpret dates when performing operations down the line.

```{r}
# convert column to date class
ind_noD$editedDate <- as.Date(ind_noD$editedDate)
str(ind_noD$editedDate)

status_noD$date <- as.Date(status_noD$date)
str(status_noD$date)
```

```{r}
# retain only the max of the date for each individualID
ind_last <- ind_noD %>%
    group_by(individualID) %>%
    filter(editedDate == max(editedDate))

# oh wait, duplicate dates, retain only one
ind_lastnoD <- ind_last %>%
    group_by(editedDate, individualID) %>%
    filter(row_number() == 1)
```

Remove these two columns that will be duplicated in the `ind_noD` data frame.

```{r}
status_noD <- select(status_noD, -taxonID, -scientificName)
```

Join the dataframes:

```{r}
# Create a new dataframe "phe_ind" with all the data from status and some from ind_lastnoD
phe_ind <- left_join(status_noD, ind_lastnoD)

glimpse(phe_ind)
```

### Dealing with NAs

Some columns look like they might only have NAs. We can check the proportion 
of NAs by first summing the number of NA values per column with `map_df()` 
from the `purrr` package, followed by standard `dplyr`/`tidyr` operations to 
reshape the data frame. 

`map_df` works by performing an operation on each of the columns in its input
data frame and returning a data frame as output. The `~` character is shorthand
for 'function', and tells R that the `.` within `is.na()` refers to individual
columns and not the entire input data frame (as would otherwise be the case
with the `%>%` operator). 

```{r}
# Some columns look like they might only have NAs. Check this:
phe_ind %>% 
  map_df(~ sum(is.na(.))) %>% # counts NA values in each column
  gather(column, values) %>% # reshapes data frame
  mutate(proportion_na = values / nrow(phe_ind)) %>% 
  arrange(desc(proportion_na)) # arrange in descending order of proportion_na
```

Now that we've identified which columns are only NA, let's remove them 
from our dataset. `dplyr` offers a helpful function called `select_if()` 
that allows us to only keep columns satisfying a certain condition. 
`select_if` is an example of a _scoped_ function; scoped functions apply operations
on all variables within a specified subset. Where `select` requires us
to specify column names to keep, `select_if` allows us to instead retain or remove
columns based on a condition -- which in our case pertains to whether
the columns contain only NAs. 

```{r}
# first, define a function that checks whether not all elements are NA
not_all_na <- function(x) {
  return(!all(is.na(x))) # if all of the vector is just NA, return FALSE
}

# select_if can now use the above function
phe_ind <- phe_ind %>% 
  select_if(not_all_na)

glimpse(phe_ind)
```

### Subsetting the dataset
 
Let's look at just the "Leaves" phenophase for "Deciduous broadleaf" plants.

```{r}
plant_pheno <- phe_ind %>% 
  filter(phenophaseName == "Leaves" & growthForm == "Deciduous broadleaf")
```

Many of the columns aren't that relevant for us, but some that we're 
definitely interested in are `date`, `phenophaseIntensity`, and `scientificName`. 
Let's take a look at what kind of factors we have in the last two columns.

```{r}
plant_pheno %>% 
  count(scientificName)

plant_pheno %>% 
  count(phenophaseIntensity)
```

These are 7 species of tree / shrub in this subsetted dataset. Feel free to 
look up what the common names are for each of these; for example, 
*Liriodendron tulipifera*, the tulip tree, can be found along the US east 
coast as well as in Southern Ontario.

![Lipidoptera tulipifera, by Jean-Pol GRANDMONT - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=9873223](image/Liriodendron_tulipifera.png)

We're going to look at how the phenophase intensity for "Leaves" changes over 
the course of time. Let's look at the structure of the `phenophaseIntensity` column:

```{r}
str(plant_pheno$phenophaseIntensity)
```

We want to be able to use these as numbers, and since they're binned in windows 
of percentage, we can manually convert the bin label to the numeric midpoint of 
the bin. This is approximate, and there are multiple valid choices you could make 
for how to assign a number to each bin. To do this, we'll make use of the 
`case_when()` function, which allows us to `mutate()` on multiple conditions.
`case_when` is very useful if there are multiple conditions that our `mutate()`
operation needs to account for, as opposed to `ifelse()`, which can only handle
one condition at a time. To use `case_when`, we first write out the condition 
followed by a `~` and the corresponding operation for that condition. 

```{r}
# convert phenophase intensities to numbers for plotting

# create new column called phenophaseIntensityMidpoint 
# and fill it with the numeric bin midpoint
plant_pheno <- plant_pheno %>% 
  mutate(
    phenophaseIntensityMidpoint = case_when(
      phenophaseIntensity == "< 5%" ~ 0.05 / 2,
      phenophaseIntensity == "5-24%" ~ (0.05 + 0.24) / 2,
      phenophaseIntensity == "25-49%" ~ (0.20 + 0.49) / 2,
      phenophaseIntensity == "50-74%" ~ (0.5 + 0.74) / 2,
      phenophaseIntensity == "75-94%" ~ (0.85 + 0.94) / 2,
      phenophaseIntensity == ">= 95%" ~ (1 + 0.95) / 2
    )
  )

# convert to numeric type explicitly
plant_pheno$phenophaseIntensityMidpoint <- as.numeric(plant_pheno$phenophaseIntensityMidpoint)

# check the structure of the new column
str(plant_pheno$phenophaseIntensityMidpoint)
```

The last processing step is to make sure that the `date` column is a `Date` object.

```{r}
# convert the date column to a date
plant_pheno$date <- as.Date(plant_pheno$date)
```

## Plotting and fitting the data

Let's plot the phenophase intensity over time, grouped by the species.

```{r, warning=FALSE}
ggplot(plant_pheno, aes(x = date, y = phenophaseIntensityMidpoint, color = scientificName)) +
  geom_point() 
```

The pattern we would expect is already visible in this first plot - the leaves come out in the spring, 
then disappear again in October. But there might be differences between species and between individuals 
in a species. One way we could try to assess this is to fit the same model to subgroups of the data and
then compare the fitted parameters to see if there are differences. 


### Fit an oscillatory model to the data

Let's try to fit a sine wave to the phenophase intensity. A generic sine wave has four parameters:

$$y = A \text{sin}(kx - b) + c$$
where $k = 2\pi / T$, $T$ is the period or length of time between peaks, 
$A$ is the amplitude, $b$ is the phase, and $c$ is the vertical shift or offset. 

Let's plot a sine wave:

```{r}
# plot a sine wave
x <- seq(0, 3, 0.01)
A <- 1
k <- 2*pi
b <- 0.5
c <- 0

qplot(x, A*sin(k*x-b)+c) +
  geom_line()
```

In order to use the date as a numeric x-axis for fitting purposes, 
we need to convert it from a `date` object to a numeric object. 

```{r}
# check the structure of the date column
str(plant_pheno$date)
```

```{r, message=FALSE}
# create a list of all the dates in ascending order
dates = plant_pheno %>% 
  arrange(date) %>% 
  select(date)

# create a function to subtract two dates
subtract_dates <- function(date1, date2) {
  result <- date1 - date2
}

# create a numeric dates list to use in fitting
dates_numeric <- mapply(subtract_dates,
                 dates$date, # first argument in subtract_dates function 
                 dates$date[1]) # second argument in subtract_dates function - the first date

# add numeric dates to the dataframe
plant_pheno$date_numeric <- mapply(subtract_dates, 
                                                plant_pheno$date,
                                                dates$date[1]) # subtract the first date
```

```{r}
# drop rows that have NA in phenophaseIntensityMidpoint column
plant_pheno <- 
  plant_pheno[!is.na(plant_pheno$phenophaseIntensityMidpoint),]
```

Let's fit a particular individual's phenophase intensity.

```{r}
# find an individual
plant_pheno %>% 
  filter(scientificName == "Liriodendron tulipifera L.") %>% 
  select(scientificName, individualID) %>% 
  head(1)

# make a data frame with one individual
tulip_tree <- plant_pheno %>% 
  filter(individualID == "NEON.PLA.D02.SCBI.06071")
```

Now we will create a test sine function to get a rough idea for the 
parameters.

```{r}
# period = 365 days
# wavenumber of sine function = 2 * pi/period 

sine_model <- function(x, amplitude, period, phase, offset) {
  return(amplitude*sin(2*pi/period*x + phase) + offset)
}
A <- 0.5 # phenophase intensity values go between 0 and 1
offset <- 0.5 # add 0.5 to make min 0 and max 1
period <- 365 # number of days in a year
phase <- 0.5 # this is a guess

guess_curve <- sine_model(dates_numeric, A, period, phase, offset)
```

```{r}
qplot(x = dates$date, y = guess_curve) + # guess data
  geom_point(data = tulip_tree,  # actual data
       aes(x = date, y = phenophaseIntensityMidpoint, colour = scientificName)) 
```

We will only fit $b$, the horizontal shift, since we already know the other 
parameters: we know that the minimum and maximum must be 0 and 1 since the 
intensity goes from 0 to 100%, and this sets both $c$ and $A$.
We also know $k$, since we know that the oscillation should go around once in a year (365 days).
If we were being more fancy, we might want to take into account things like temperature and leap years;
there is a `pheno` package in R specifically for plotting and analyzing phenological data.

```{r}
# Make a range of b parameter values to try
b_vals <- seq(0.2, 0.8, by = 0.01)

# use the function 'sapply' to loop over b_vals list
resids_sq <- sapply(b_vals, function(b) {
    prediction <- 0.5*sin(2*pi/365*tulip_tree$date_numeric + b) +0.5
    residuals <- prediction - tulip_tree$phenophaseIntensityMidpoint
    sum(residuals^2)
})
```

Plotting the sum of the residuals squards vs. the fit parameter 
will show us if it worked: the best fit is the minimum of the curve.

```{r}
qplot(b_vals, resids_sq)
```

We can see visually that the minimum is around $b = 0.5$, but to extract that number 
from the list we can use the function `which`:

```{r}
best_fit <- which(resids_sq == min(resids_sq))
b_fit <- b_vals[best_fit] 
b_fit
```

Finally, let's plot the fit against the original data:

```{r}
ggplot(data = tulip_tree, aes(x = date, y = phenophaseIntensityMidpoint)) +
  geom_point() + 
  geom_point(aes(x = date, y = 0.5*sin(2*pi/365*tulip_tree$date_numeric + b_fit) +0.5, colour = 'red'))
```

Not bad! Next, we'll do this for every single individual in the entire dataset. 
With fits for each individual, we can look at the distribution of the phase shifts
for each species to get a sense of when each species gets its leaves.

### Fit each individual separately

```{r}
# Make a range of b parameter values to try
b_vals <- seq(0, 1.1, by = 0.015)

# create a function that does least squares for this model
least_squares <- function(df, b_vals) {
  resids_sq <- sapply(b_vals, function(b) {
    prediction <- 0.5*sin(2*pi/365*df$date_numeric + b) +0.5
    residuals <- prediction - df$phenophaseIntensityMidpoint
    sum(residuals^2)
    })
  return(data.frame(b_vals, resids_sq))
}

# create a data frame that contains the residuals grouped by species
resids_sq_individuals <- plant_pheno %>%
  group_by(individualID, scientificName) %>% 
  do(data.frame(val=least_squares(., b_vals)))
```

```{r}
ggplot(resids_sq_individuals, aes(x=val.b_vals, y = val.resids_sq, colour = scientificName)) +
  geom_point()
```
Get the best fit $b$ value for each individual:

```{r}
# store the best fit values in a new data frame
b_df_all <- resids_sq_individuals %>% 
  group_by(individualID, scientificName) %>% 
  summarize(phase = b_vals[which(val.resids_sq== min(val.resids_sq))])

head(b_df_all)
```

We have best fit values for the phase for every individual plant in our dataset.
To visualize this, we will make a violin plot of the phases, grouped by species.

```{r}
# violin plot
ggplot(b_df_all, aes(x = scientificName, y = phase, colour = scientificName)) +
  geom_violin(scale = 'count', draw_quantiles = 0.5) +
  geom_jitter(height = 0, width = 0.1) +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())
```

The phase $b$ is a positive number for each of these, and in our model we fit the curve
like this:

$$y = A\text{sin}(kx + b) + c$$
A positive $b$ shifts the curve to the left, so our model suggests that species with larger 
$b$ either get their leaves earlier, lose their leaves earlier, or peak earlier, or some 
combination of all three. I'm not a phenologist,
but we can do some reading to see if the general pattern we observe is really there:
*Lonicera maackii*, Amur honeysuckle (an invasive species), does in fact leaf out very 
early in the spring and keeps its leaves longer than other plants, 
according to [this paper](http://www.bioone.org/doi/abs/10.3159/08-RA-109.1) [@Mcewan2009].

This actually suggests a limitation to our approach: by fitting the same period of sine 
curve to each dataset, we have no information on how long the leaf season is for each 
plant: we have no way of distinguishing a plant that leafs earlier and loses its leaves 
earlier from a plant that peaks earlier and keeps its leaves longer.
It's hard to interpret what $b$ actually means in terms of the phenophase. Let's plot
the smallest and largest $b$ species together:

```{r}
plant_pheno %>% 
  filter(taxonID == "LOMA6" | taxonID == "JUNI") %>% # easier than typing the full scientific name
  ggplot(aes(x=date, y = phenophaseIntensityMidpoint, colour = scientificName)) +
  geom_point()
```

By examining these two species side by side, we can see that *Lonicera maackii*
peaks at about the same time as *Juglans nigra*, but it gets leaves earlier 
and keeps its leaves later. To actually distinguish each of these events, 
we would need a more complicated model with more fit parameters.

When phenologists model phenophase changes, they often use *growing degree
days* (GDD) instead of the true date, since plants respond to environmental signals
like temperature and not to the calendar date. In the extension below
we will calculate cumulative growing degree days and plot the phenophase 
intensity as above vs. cumulative growing degree days. 

## Extension - Growing Degree Days

There are several methods to calculate degree days [@Battel2017; @Nugent2004; @Miller2018]. 
Here we will use a simple method calculated by averaging the maximum and minimum 
temperatures in a day and subtracting a base temperature below which no growth is expected:

$$\text{GDD} = \frac{\text{max temp} - \text{min temp}}{2} - \text{base temp}$$ 

Negative GDD values are set to zero. To choose a base temperature, we ideally need to know 
how the species we're investigating grows. For illustration purposes here, we'll choose
a base temperature of 10 degrees Celsius (50 degrees Fahrenheit).

To combine this information with the phenology information above, we will use air temperature 
measurements collected every 30 minutes at two of the NEON sites we're looking at.

```{r, message = FALSE}
# 30 minute averages of 1Hz observations of air temperature in degrees Celsius
temp <- read_csv('NEON-pheno-temp-timeseries/temp/SAAT_30min.csv') 

# what are the column names?
names(temp)
```

First, let's convert the date and time column to a date column named `date`, and the temperature measurements to a numeric column. 

```{r}
temp <- temp %>%
    mutate(date = as.Date(startDateTime),
           tempSingleMean = as.numeric(tempSingleMean))
```

We need the max and min temperatures to calculate growing degree days using the formula above.
Here, we'll create a new data frame called `temp_max_min` that has the date, site ID, and max and min temperatures for that day. 

```{r}
# create temp_max_min data frame
temp_max_min <- temp %>% 
  filter(!is.na(tempSingleMean)) %>% # remove NA values from temperature measurements
  group_by(date, siteID) %>% 
  # create max and min columns
  summarize(max = max(tempSingleMean), min = min(tempSingleMean)) 

head(temp_max_min, 3)
```

Next, we'll actually calcuale growing degree days. We define a baseline temperature of 10 Celsius,
then average the `max` and `min` columns and subtract the baseline temperature. Finally, 
we use `mutate` and `case_when` to convert negative GDD values to zero.


```{r}
# define baseline temp
baseline_temp <- 10 # degrees Celsius
# calculate GDD and assign to new column
temp_max_min$GDD <- (temp_max_min$max + temp_max_min$min) / 2 - baseline_temp 

# cut off GDD at 0: negative values are assumed to be 0.
temp_max_min <- temp_max_min %>%  mutate(
    GDD = case_when(
      GDD < 0 ~ 0,
      GDD >= 0 ~ GDD))
```

We want to be able to group these results by year, since GDD resets at the start of each growing season.
We create a new column `year` by formatting the `date` column to select just the year.

```{r}
temp_max_min$year <- format(temp_max_min$date, "%Y") # make a new column with the year
```

Now we can calculate the cumulative growing degree days - the cumulative effect of temperature that 
the plant will feel. We group by year and site, since temperature measurements are specific to site
and cumulative GDD should reset at the start of each year.

```{r}
# calculate cumulative GDD for each site and year separately
temp_max_min <- temp_max_min %>% 
  group_by(siteID, year) %>% 
  arrange(date) %>% 
  mutate(GDD_cumulative = cumsum(GDD))

head(temp_max_min, 3)
```

To plot phenophase vs. GDD, we will join this temperature data with the plant_pheno data we had
from before. First, we find which columns are overlapping to confirm that we want to join on those,
and then we filter for only the year 2016 since the 2015 temperature data didn't start until April.

```{r}

# where is there an intersection of names?
sameName <- intersect(names(temp_max_min), names(plant_pheno))
sameName

# take just the year 2016 - it's a complete year
temp_max_min_2016 <- filter(temp_max_min, year == 2016)

# create a new data frame with the results of the join
plant_pheno_temp <- left_join(temp_max_min_2016, plant_pheno)

head(plant_pheno_temp, 3)
```

Let's plot phenophase intensity, coloured by species, vs. cumulative GDD. We can see a nice pattern - 
all the plants accumulate leaf cover as GDD accumulates. 

```{r, warning = FALSE}
ggplot(plant_pheno_temp, 
       aes(x = GDD_cumulative, 
           y = phenophaseIntensityMidpoint, 
           color = scientificName)) + 
  geom_point() 
```

Realistically we should only be looking at leaves coming out, not leaf loss in the fall -
there's probably a better model for leaf loss than GDD. We can filter out all data with 
`GDD_cumulative` $< 1000$, since by that point all plants are at 95-100% cover. 

```{r, warning = FALSE}
plant_pheno_temp %>% 
  filter(GDD_cumulative <= 1000) %>% 
  ggplot(aes(x = GDD_cumulative, 
             y = phenophaseIntensityMidpoint, 
             color = scientificName)) +
   geom_point(alpha = 0.5) 
```

Here we get a hint of of the same information we saw earlier, namely that some species
get their leaves earlier than others. We could fit this data to get a quantitative estimate
of how leaf cover depends on GDD. 

# References