--- title: "Working with plant phenology data and fitting a nonlinear model using least squares in R" author: "Madeleine Bonsma-Fisher and Ahmed Hasan" output: html_document: df_print: paged bibliography: NEON_phenology.bib urlcolor: blue --- ## Introduction *Phenology* is the study of periodic or cyclic natural phenomena, and this dataset contains observations of the seasonal cycles of plants at three NEON sites in the US: Blandy Experimental Farm ([BLAN](https://www.neonscience.org/field-sites/field-sites-map/BLAN)) and the Smithsonian Conservation Biology Institute ([SCBI](https://www.neonscience.org/field-sites/field-sites-map/scbi)) in Virginia, and the Smithsonian Environmental Research Center ([SERC](https://www.neonscience.org/field-sites/field-sites-map/serc)) in Maryland. In this tutorial, we'll plot some phenology data and fit an oscillatory model to determine when different species get and lose their leaves. ## Load and organize data ### Clean up and merge the two data frames The first part of this tutorial is based on a data cleanup from https://www.neonscience.org/neon-plant-pheno-data-r. There are two data files we'll be combining: the first (`phe_perindividual.csv`) contains data for each individual organism observed such as its location and scientific name, and the second (`phe_statusintensity.csv`) contains observations of several phenophases and their intensity. ```{r, warning=FALSE, message=FALSE} library(tidyverse) # load the two data files ind <- read_csv('NEON-pheno-temp-timeseries/pheno/phe_perindividual.csv') status <- read_csv('NEON-pheno-temp-timeseries/pheno/phe_statusintensity.csv') names(ind) names(status) ``` We'll remove the UID field which uniquely identifies each row -- we don't need to uniquely identify entries. ```{r} ind <- select(ind, -uid) status <- select(status, -uid) ``` Next, we remove duplicates that result from stacking tables over many months -- we don't need duplicate observations from the same date, tree, and site. ```{r} ind_noD <- distinct(ind) nrow(ind) - nrow(ind_noD) # how many rows were removed ``` ```{r} status_noD <- distinct(status) nrow(status) - nrow(status_noD) # how many rows were removed ``` Some of the columns in `status_noD` that have the same name as those in `ind_noD` should be renamed so that we can join the two tables and keep those fields separate. When using `rename()`, remember that each new column name is written first, followed by the previous name (i.e. `new_name = old_name`). ```{r} # where is there an intersection of names? sameName <- intersect(names(status_noD), names(ind_noD)) sameName # rename status editedDate status_noD <- rename(status_noD, editedDateStat = editedDate, measuredByStat = measuredBy, recordedByStat = recordedBy, samplingProtocolVersionStat = samplingProtocolVersion, remarksStat = remarks, dataQFStat = dataQF) ``` Next, we'll convert the date column to R's `Date` object type. This makes R correctly interpret dates when performing operations down the line. ```{r} # convert column to date class ind_noD$editedDate <- as.Date(ind_noD$editedDate) str(ind_noD$editedDate) status_noD$date <- as.Date(status_noD$date) str(status_noD$date) ``` ```{r} # retain only the max of the date for each individualID ind_last <- ind_noD %>% group_by(individualID) %>% filter(editedDate == max(editedDate)) # oh wait, duplicate dates, retain only one ind_lastnoD <- ind_last %>% group_by(editedDate, individualID) %>% filter(row_number() == 1) ``` Remove these two columns that will be duplicated in the `ind_noD` data frame. ```{r} status_noD <- select(status_noD, -taxonID, -scientificName) ``` Join the dataframes: ```{r} # Create a new dataframe "phe_ind" with all the data from status and some from ind_lastnoD phe_ind <- left_join(status_noD, ind_lastnoD) glimpse(phe_ind) ``` ### Dealing with NAs Some columns look like they might only have NAs. We can check the proportion of NAs by first summing the number of NA values per column with `map_df()` from the `purrr` package, followed by standard `dplyr`/`tidyr` operations to reshape the data frame. `map_df` works by performing an operation on each of the columns in its input data frame and returning a data frame as output. The `~` character is shorthand for 'function', and tells R that the `.` within `is.na()` refers to individual columns and not the entire input data frame (as would otherwise be the case with the `%>%` operator). ```{r} # Some columns look like they might only have NAs. Check this: phe_ind %>% map_df(~ sum(is.na(.))) %>% # counts NA values in each column gather(column, values) %>% # reshapes data frame mutate(proportion_na = values / nrow(phe_ind)) %>% arrange(desc(proportion_na)) # arrange in descending order of proportion_na ``` Now that we've identified which columns are only NA, let's remove them from our dataset. `dplyr` offers a helpful function called `select_if()` that allows us to only keep columns satisfying a certain condition. `select_if` is an example of a _scoped_ function; scoped functions apply operations on all variables within a specified subset. Where `select` requires us to specify column names to keep, `select_if` allows us to instead retain or remove columns based on a condition -- which in our case pertains to whether the columns contain only NAs. ```{r} # first, define a function that checks whether not all elements are NA not_all_na <- function(x) { return(!all(is.na(x))) # if all of the vector is just NA, return FALSE } # select_if can now use the above function phe_ind <- phe_ind %>% select_if(not_all_na) glimpse(phe_ind) ``` ### Subsetting the dataset Let's look at just the "Leaves" phenophase for "Deciduous broadleaf" plants. ```{r} plant_pheno <- phe_ind %>% filter(phenophaseName == "Leaves" & growthForm == "Deciduous broadleaf") ``` Many of the columns aren't that relevant for us, but some that we're definitely interested in are `date`, `phenophaseIntensity`, and `scientificName`. Let's take a look at what kind of factors we have in the last two columns. ```{r} plant_pheno %>% count(scientificName) plant_pheno %>% count(phenophaseIntensity) ``` These are 7 species of tree / shrub in this subsetted dataset. Feel free to look up what the common names are for each of these; for example, *Liriodendron tulipifera*, the tulip tree, can be found along the US east coast as well as in Southern Ontario. ![Lipidoptera tulipifera, by Jean-Pol GRANDMONT - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=9873223](image/Liriodendron_tulipifera.png) We're going to look at how the phenophase intensity for "Leaves" changes over the course of time. Let's look at the structure of the `phenophaseIntensity` column: ```{r} str(plant_pheno$phenophaseIntensity) ``` We want to be able to use these as numbers, and since they're binned in windows of percentage, we can manually convert the bin label to the numeric midpoint of the bin. This is approximate, and there are multiple valid choices you could make for how to assign a number to each bin. To do this, we'll make use of the `case_when()` function, which allows us to `mutate()` on multiple conditions. `case_when` is very useful if there are multiple conditions that our `mutate()` operation needs to account for, as opposed to `ifelse()`, which can only handle one condition at a time. To use `case_when`, we first write out the condition followed by a `~` and the corresponding operation for that condition. ```{r} # convert phenophase intensities to numbers for plotting # create new column called phenophaseIntensityMidpoint # and fill it with the numeric bin midpoint plant_pheno <- plant_pheno %>% mutate( phenophaseIntensityMidpoint = case_when( phenophaseIntensity == "< 5%" ~ 0.05 / 2, phenophaseIntensity == "5-24%" ~ (0.05 + 0.24) / 2, phenophaseIntensity == "25-49%" ~ (0.20 + 0.49) / 2, phenophaseIntensity == "50-74%" ~ (0.5 + 0.74) / 2, phenophaseIntensity == "75-94%" ~ (0.85 + 0.94) / 2, phenophaseIntensity == ">= 95%" ~ (1 + 0.95) / 2 ) ) # convert to numeric type explicitly plant_pheno$phenophaseIntensityMidpoint <- as.numeric(plant_pheno$phenophaseIntensityMidpoint) # check the structure of the new column str(plant_pheno$phenophaseIntensityMidpoint) ``` The last processing step is to make sure that the `date` column is a `Date` object. ```{r} # convert the date column to a date plant_pheno$date <- as.Date(plant_pheno$date) ``` ## Plotting and fitting the data Let's plot the phenophase intensity over time, grouped by the species. ```{r, warning=FALSE} ggplot(plant_pheno, aes(x = date, y = phenophaseIntensityMidpoint, color = scientificName)) + geom_point() ``` The pattern we would expect is already visible in this first plot - the leaves come out in the spring, then disappear again in October. But there might be differences between species and between individuals in a species. One way we could try to assess this is to fit the same model to subgroups of the data and then compare the fitted parameters to see if there are differences. ### Fit an oscillatory model to the data Let's try to fit a sine wave to the phenophase intensity. A generic sine wave has four parameters: $$y = A \text{sin}(kx - b) + c$$ where $k = 2\pi / T$, $T$ is the period or length of time between peaks, $A$ is the amplitude, $b$ is the phase, and $c$ is the vertical shift or offset. Let's plot a sine wave: ```{r} # plot a sine wave x <- seq(0, 3, 0.01) A <- 1 k <- 2*pi b <- 0.5 c <- 0 qplot(x, A*sin(k*x-b)+c) + geom_line() ``` In order to use the date as a numeric x-axis for fitting purposes, we need to convert it from a `date` object to a numeric object. ```{r} # check the structure of the date column str(plant_pheno$date) ``` ```{r, message=FALSE} # create a list of all the dates in ascending order dates = plant_pheno %>% arrange(date) %>% select(date) # create a function to subtract two dates subtract_dates <- function(date1, date2) { result <- date1 - date2 } # create a numeric dates list to use in fitting dates_numeric <- mapply(subtract_dates, dates$date, # first argument in subtract_dates function dates$date[1]) # second argument in subtract_dates function - the first date # add numeric dates to the dataframe plant_pheno$date_numeric <- mapply(subtract_dates, plant_pheno$date, dates$date[1]) # subtract the first date ``` ```{r} # drop rows that have NA in phenophaseIntensityMidpoint column plant_pheno <- plant_pheno[!is.na(plant_pheno$phenophaseIntensityMidpoint),] ``` Let's fit a particular individual's phenophase intensity. ```{r} # find an individual plant_pheno %>% filter(scientificName == "Liriodendron tulipifera L.") %>% select(scientificName, individualID) %>% head(1) # make a data frame with one individual tulip_tree <- plant_pheno %>% filter(individualID == "NEON.PLA.D02.SCBI.06071") ``` Now we will create a test sine function to get a rough idea for the parameters. ```{r} # period = 365 days # wavenumber of sine function = 2 * pi/period sine_model <- function(x, amplitude, period, phase, offset) { return(amplitude*sin(2*pi/period*x + phase) + offset) } A <- 0.5 # phenophase intensity values go between 0 and 1 offset <- 0.5 # add 0.5 to make min 0 and max 1 period <- 365 # number of days in a year phase <- 0.5 # this is a guess guess_curve <- sine_model(dates_numeric, A, period, phase, offset) ``` ```{r} qplot(x = dates$date, y = guess_curve) + # guess data geom_point(data = tulip_tree, # actual data aes(x = date, y = phenophaseIntensityMidpoint, colour = scientificName)) ``` We will only fit $b$, the horizontal shift, since we already know the other parameters: we know that the minimum and maximum must be 0 and 1 since the intensity goes from 0 to 100%, and this sets both $c$ and $A$. We also know $k$, since we know that the oscillation should go around once in a year (365 days). If we were being more fancy, we might want to take into account things like temperature and leap years; there is a `pheno` package in R specifically for plotting and analyzing phenological data. ```{r} # Make a range of b parameter values to try b_vals <- seq(0.2, 0.8, by = 0.01) # use the function 'sapply' to loop over b_vals list resids_sq <- sapply(b_vals, function(b) { prediction <- 0.5*sin(2*pi/365*tulip_tree$date_numeric + b) +0.5 residuals <- prediction - tulip_tree$phenophaseIntensityMidpoint sum(residuals^2) }) ``` Plotting the sum of the residuals squards vs. the fit parameter will show us if it worked: the best fit is the minimum of the curve. ```{r} qplot(b_vals, resids_sq) ``` We can see visually that the minimum is around $b = 0.5$, but to extract that number from the list we can use the function `which`: ```{r} best_fit <- which(resids_sq == min(resids_sq)) b_fit <- b_vals[best_fit] b_fit ``` Finally, let's plot the fit against the original data: ```{r} ggplot(data = tulip_tree, aes(x = date, y = phenophaseIntensityMidpoint)) + geom_point() + geom_point(aes(x = date, y = 0.5*sin(2*pi/365*tulip_tree$date_numeric + b_fit) +0.5, colour = 'red')) ``` Not bad! Next, we'll do this for every single individual in the entire dataset. With fits for each individual, we can look at the distribution of the phase shifts for each species to get a sense of when each species gets its leaves. ### Fit each individual separately ```{r} # Make a range of b parameter values to try b_vals <- seq(0, 1.1, by = 0.015) # create a function that does least squares for this model least_squares <- function(df, b_vals) { resids_sq <- sapply(b_vals, function(b) { prediction <- 0.5*sin(2*pi/365*df$date_numeric + b) +0.5 residuals <- prediction - df$phenophaseIntensityMidpoint sum(residuals^2) }) return(data.frame(b_vals, resids_sq)) } # create a data frame that contains the residuals grouped by species resids_sq_individuals <- plant_pheno %>% group_by(individualID, scientificName) %>% do(data.frame(val=least_squares(., b_vals))) ``` ```{r} ggplot(resids_sq_individuals, aes(x=val.b_vals, y = val.resids_sq, colour = scientificName)) + geom_point() ``` Get the best fit $b$ value for each individual: ```{r} # store the best fit values in a new data frame b_df_all <- resids_sq_individuals %>% group_by(individualID, scientificName) %>% summarize(phase = b_vals[which(val.resids_sq== min(val.resids_sq))]) head(b_df_all) ``` We have best fit values for the phase for every individual plant in our dataset. To visualize this, we will make a violin plot of the phases, grouped by species. ```{r} # violin plot ggplot(b_df_all, aes(x = scientificName, y = phase, colour = scientificName)) + geom_violin(scale = 'count', draw_quantiles = 0.5) + geom_jitter(height = 0, width = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) ``` The phase $b$ is a positive number for each of these, and in our model we fit the curve like this: $$y = A\text{sin}(kx + b) + c$$ A positive $b$ shifts the curve to the left, so our model suggests that species with larger $b$ either get their leaves earlier, lose their leaves earlier, or peak earlier, or some combination of all three. I'm not a phenologist, but we can do some reading to see if the general pattern we observe is really there: *Lonicera maackii*, Amur honeysuckle (an invasive species), does in fact leaf out very early in the spring and keeps its leaves longer than other plants, according to [this paper](http://www.bioone.org/doi/abs/10.3159/08-RA-109.1) [@Mcewan2009]. This actually suggests a limitation to our approach: by fitting the same period of sine curve to each dataset, we have no information on how long the leaf season is for each plant: we have no way of distinguishing a plant that leafs earlier and loses its leaves earlier from a plant that peaks earlier and keeps its leaves longer. It's hard to interpret what $b$ actually means in terms of the phenophase. Let's plot the smallest and largest $b$ species together: ```{r} plant_pheno %>% filter(taxonID == "LOMA6" | taxonID == "JUNI") %>% # easier than typing the full scientific name ggplot(aes(x=date, y = phenophaseIntensityMidpoint, colour = scientificName)) + geom_point() ``` By examining these two species side by side, we can see that *Lonicera maackii* peaks at about the same time as *Juglans nigra*, but it gets leaves earlier and keeps its leaves later. To actually distinguish each of these events, we would need a more complicated model with more fit parameters. When phenologists model phenophase changes, they often use *growing degree days* (GDD) instead of the true date, since plants respond to environmental signals like temperature and not to the calendar date. In the extension below we will calculate cumulative growing degree days and plot the phenophase intensity as above vs. cumulative growing degree days. ## Extension - Growing Degree Days There are several methods to calculate degree days [@Battel2017; @Nugent2004; @Miller2018]. Here we will use a simple method calculated by averaging the maximum and minimum temperatures in a day and subtracting a base temperature below which no growth is expected: $$\text{GDD} = \frac{\text{max temp} - \text{min temp}}{2} - \text{base temp}$$ Negative GDD values are set to zero. To choose a base temperature, we ideally need to know how the species we're investigating grows. For illustration purposes here, we'll choose a base temperature of 10 degrees Celsius (50 degrees Fahrenheit). To combine this information with the phenology information above, we will use air temperature measurements collected every 30 minutes at two of the NEON sites we're looking at. ```{r, message = FALSE} # 30 minute averages of 1Hz observations of air temperature in degrees Celsius temp <- read_csv('NEON-pheno-temp-timeseries/temp/SAAT_30min.csv') # what are the column names? names(temp) ``` First, let's convert the date and time column to a date column named `date`, and the temperature measurements to a numeric column. ```{r} temp <- temp %>% mutate(date = as.Date(startDateTime), tempSingleMean = as.numeric(tempSingleMean)) ``` We need the max and min temperatures to calculate growing degree days using the formula above. Here, we'll create a new data frame called `temp_max_min` that has the date, site ID, and max and min temperatures for that day. ```{r} # create temp_max_min data frame temp_max_min <- temp %>% filter(!is.na(tempSingleMean)) %>% # remove NA values from temperature measurements group_by(date, siteID) %>% # create max and min columns summarize(max = max(tempSingleMean), min = min(tempSingleMean)) head(temp_max_min, 3) ``` Next, we'll actually calcuale growing degree days. We define a baseline temperature of 10 Celsius, then average the `max` and `min` columns and subtract the baseline temperature. Finally, we use `mutate` and `case_when` to convert negative GDD values to zero. ```{r} # define baseline temp baseline_temp <- 10 # degrees Celsius # calculate GDD and assign to new column temp_max_min$GDD <- (temp_max_min$max + temp_max_min$min) / 2 - baseline_temp # cut off GDD at 0: negative values are assumed to be 0. temp_max_min <- temp_max_min %>% mutate( GDD = case_when( GDD < 0 ~ 0, GDD >= 0 ~ GDD)) ``` We want to be able to group these results by year, since GDD resets at the start of each growing season. We create a new column `year` by formatting the `date` column to select just the year. ```{r} temp_max_min$year <- format(temp_max_min$date, "%Y") # make a new column with the year ``` Now we can calculate the cumulative growing degree days - the cumulative effect of temperature that the plant will feel. We group by year and site, since temperature measurements are specific to site and cumulative GDD should reset at the start of each year. ```{r} # calculate cumulative GDD for each site and year separately temp_max_min <- temp_max_min %>% group_by(siteID, year) %>% arrange(date) %>% mutate(GDD_cumulative = cumsum(GDD)) head(temp_max_min, 3) ``` To plot phenophase vs. GDD, we will join this temperature data with the plant_pheno data we had from before. First, we find which columns are overlapping to confirm that we want to join on those, and then we filter for only the year 2016 since the 2015 temperature data didn't start until April. ```{r} # where is there an intersection of names? sameName <- intersect(names(temp_max_min), names(plant_pheno)) sameName # take just the year 2016 - it's a complete year temp_max_min_2016 <- filter(temp_max_min, year == 2016) # create a new data frame with the results of the join plant_pheno_temp <- left_join(temp_max_min_2016, plant_pheno) head(plant_pheno_temp, 3) ``` Let's plot phenophase intensity, coloured by species, vs. cumulative GDD. We can see a nice pattern - all the plants accumulate leaf cover as GDD accumulates. ```{r, warning = FALSE} ggplot(plant_pheno_temp, aes(x = GDD_cumulative, y = phenophaseIntensityMidpoint, color = scientificName)) + geom_point() ``` Realistically we should only be looking at leaves coming out, not leaf loss in the fall - there's probably a better model for leaf loss than GDD. We can filter out all data with `GDD_cumulative` $< 1000$, since by that point all plants are at 95-100% cover. ```{r, warning = FALSE} plant_pheno_temp %>% filter(GDD_cumulative <= 1000) %>% ggplot(aes(x = GDD_cumulative, y = phenophaseIntensityMidpoint, color = scientificName)) + geom_point(alpha = 0.5) ``` Here we get a hint of of the same information we saw earlier, namely that some species get their leaves earlier than others. We could fit this data to get a quantitative estimate of how leaf cover depends on GDD. # References