Introduction

Phenology is the study of periodic or cyclic natural phenomena, and this dataset contains observations of the seasonal cycles of plants at three NEON sites in the US: Blandy Experimental Farm (BLAN) and the Smithsonian Conservation Biology Institute (SCBI) in Virginia, and the Smithsonian Environmental Research Center (SERC) in Maryland.

In this tutorial, we’ll plot some phenology data and fit an oscillatory model to determine when different species get and lose their leaves.

Load and organize data

Clean up and merge the two data frames

The first part of this tutorial is based on a data cleanup from https://www.neonscience.org/neon-plant-pheno-data-r. There are two data files we’ll be combining: the first (phe_perindividual.csv) contains data for each individual organism observed such as its location and scientific name, and the second (phe_statusintensity.csv) contains observations of several phenophases and their intensity.

library(tidyverse)

# load the two data files 
ind <- read_csv('NEON-pheno-temp-timeseries/pheno/phe_perindividual.csv')

status <- read_csv('NEON-pheno-temp-timeseries/pheno/phe_statusintensity.csv')

names(ind)
##  [1] "uid"                         "namedLocation"              
##  [3] "domainID"                    "siteID"                     
##  [5] "plotID"                      "decimalLatitude"            
##  [7] "decimalLongitude"            "geodeticDatum"              
##  [9] "coordinateUncertainty"       "elevation"                  
## [11] "elevationUncertainty"        "subtypeSpecification"       
## [13] "transectMeter"               "directionFromTransect"      
## [15] "ninetyDegreeDistance"        "sampleLatitude"             
## [17] "sampleLongitude"             "sampleGeodeticDatum"        
## [19] "sampleCoordinateUncertainty" "sampleElevation"            
## [21] "sampleElevationUncertainty"  "addDate"                    
## [23] "editedDate"                  "individualID"               
## [25] "taxonID"                     "scientificName"             
## [27] "identificationQualifier"     "taxonRank"                  
## [29] "growthForm"                  "vstTag"                     
## [31] "samplingProtocolVersion"     "measuredBy"                 
## [33] "identifiedBy"                "recordedBy"                 
## [35] "remarks"                     "dataQF"
names(status)
##  [1] "uid"                           "namedLocation"                
##  [3] "domainID"                      "siteID"                       
##  [5] "plotID"                        "date"                         
##  [7] "editedDate"                    "dayOfYear"                    
##  [9] "individualID"                  "taxonID"                      
## [11] "scientificName"                "growthForm"                   
## [13] "phenophaseName"                "phenophaseStatus"             
## [15] "phenophaseIntensityDefinition" "phenophaseIntensity"          
## [17] "samplingProtocolVersion"       "measuredBy"                   
## [19] "recordedBy"                    "remarks"                      
## [21] "dataQF"

We’ll remove the UID field which uniquely identifies each row – we don’t need to uniquely identify entries.

ind <- select(ind, -uid)
status <- select(status, -uid)

Next, we remove duplicates that result from stacking tables over many months – we don’t need duplicate observations from the same date, tree, and site.

ind_noD <- distinct(ind)
nrow(ind) - nrow(ind_noD) # how many rows were removed
## [1] 245
status_noD <- distinct(status)
nrow(status) - nrow(status_noD) # how many rows were removed
## [1] 1599

Some of the columns in status_noD that have the same name as those in ind_noD should be renamed so that we can join the two tables and keep those fields separate. When using rename(), remember that each new column name is written first, followed by the previous name (i.e. new_name = old_name).

# where is there an intersection of names?
sameName <- intersect(names(status_noD), names(ind_noD))
sameName
##  [1] "namedLocation"           "domainID"               
##  [3] "siteID"                  "plotID"                 
##  [5] "editedDate"              "individualID"           
##  [7] "taxonID"                 "scientificName"         
##  [9] "growthForm"              "samplingProtocolVersion"
## [11] "measuredBy"              "recordedBy"             
## [13] "remarks"                 "dataQF"
# rename status editedDate
status_noD <- rename(status_noD, editedDateStat = editedDate, 
        measuredByStat = measuredBy, recordedByStat = recordedBy, 
        samplingProtocolVersionStat = samplingProtocolVersion, 
        remarksStat = remarks, dataQFStat = dataQF)

Next, we’ll convert the date column to R’s Date object type. This makes R correctly interpret dates when performing operations down the line.

# convert column to date class
ind_noD$editedDate <- as.Date(ind_noD$editedDate)
str(ind_noD$editedDate)
##  Date[1:1557], format: "2015-07-22" "2015-07-22" "2015-07-22" "2015-07-22" "2015-07-22" ...
status_noD$date <- as.Date(status_noD$date)
str(status_noD$date)
##  Date[1:85693], format: "2015-06-25" "2015-06-25" "2015-06-25" "2015-06-25" "2015-06-25" ...
# retain only the max of the date for each individualID
ind_last <- ind_noD %>%
    group_by(individualID) %>%
    filter(editedDate == max(editedDate))

# oh wait, duplicate dates, retain only one
ind_lastnoD <- ind_last %>%
    group_by(editedDate, individualID) %>%
    filter(row_number() == 1)

Remove these two columns that will be duplicated in the ind_noD data frame.

status_noD <- select(status_noD, -taxonID, -scientificName)

Join the dataframes:

# Create a new dataframe "phe_ind" with all the data from status and some from ind_lastnoD
phe_ind <- left_join(status_noD, ind_lastnoD)
## Joining, by = c("namedLocation", "domainID", "siteID", "plotID", "individualID", "growthForm")
glimpse(phe_ind)
## Observations: 85,693
## Variables: 47
## $ namedLocation                 <chr> "BLAN_061.phenology.phe", "BLAN_...
## $ domainID                      <chr> "D02", "D02", "D02", "D02", "D02...
## $ siteID                        <chr> "BLAN", "BLAN", "BLAN", "BLAN", ...
## $ plotID                        <chr> "BLAN_061", "BLAN_061", "BLAN_06...
## $ date                          <date> 2015-06-25, 2015-06-25, 2015-06...
## $ editedDateStat                <date> 2015-07-22, 2015-07-22, 2015-07...
## $ dayOfYear                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ individualID                  <chr> "NEON.PLA.D02.BLAN.06286", "NEON...
## $ growthForm                    <chr> "Deciduous broadleaf", "Deciduou...
## $ phenophaseName                <chr> "Open flowers", "Increasing leaf...
## $ phenophaseStatus              <chr> "no", "no", "no", "no", "no", "n...
## $ phenophaseIntensityDefinition <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ phenophaseIntensity           <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ samplingProtocolVersionStat   <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ measuredByStat                <chr> "cVPbPdjHNiEVZ3Vlm83FuXHus5z3id4...
## $ recordedByStat                <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ remarksStat                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ dataQFStat                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ decimalLatitude               <dbl> 39.05963, 39.05963, 39.05963, 39...
## $ decimalLongitude              <dbl> -78.07385, -78.07385, -78.07385,...
## $ geodeticDatum                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ coordinateUncertainty         <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ elevation                     <dbl> 183, 183, 183, 183, 183, 183, 18...
## $ elevationUncertainty          <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ subtypeSpecification          <chr> "primary", "primary", "primary",...
## $ transectMeter                 <dbl> 506, 491, 491, 498, 506, 476, 50...
## $ directionFromTransect         <chr> "Right", "Left", "Left", "Right"...
## $ ninetyDegreeDistance          <dbl> 1.0, 0.5, 0.5, 2.0, 1.0, 2.0, 0....
## $ sampleLatitude                <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sampleLongitude               <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sampleGeodeticDatum           <chr> "WGS84", "WGS84", "WGS84", "WGS8...
## $ sampleCoordinateUncertainty   <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sampleElevation               <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sampleElevationUncertainty    <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ addDate                       <date> 2017-03-06, 2016-04-20, 2017-02...
## $ editedDate                    <date> 2017-04-07, 2016-05-09, 2017-03...
## $ taxonID                       <chr> "LOMA6", "RHDA", "LOMA6", "LOMA6...
## $ scientificName                <chr> "Lonicera maackii (Rupr.) Herder...
## $ identificationQualifier       <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ taxonRank                     <chr> "species", "species", "species",...
## $ vstTag                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ samplingProtocolVersion       <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ measuredBy                    <chr> "cVPbPdjHNiEVZ3Vlm83FuXHus5z3id4...
## $ identifiedBy                  <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ recordedBy                    <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ remarks                       <chr> "dropped in order to have 30 LOM...
## $ dataQF                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...

Dealing with NAs

Some columns look like they might only have NAs. We can check the proportion of NAs by first summing the number of NA values per column with map_df() from the purrr package, followed by standard dplyr/tidyr operations to reshape the data frame.

map_df works by performing an operation on each of the columns in its input data frame and returning a data frame as output. The ~ character is shorthand for ‘function’, and tells R that the . within is.na() refers to individual columns and not the entire input data frame (as would otherwise be the case with the %>% operator).

# Some columns look like they might only have NAs. Check this:
phe_ind %>% 
  map_df(~ sum(is.na(.))) %>% # counts NA values in each column
  gather(column, values) %>% # reshapes data frame
  mutate(proportion_na = values / nrow(phe_ind)) %>% 
  arrange(desc(proportion_na)) # arrange in descending order of proportion_na

Now that we’ve identified which columns are only NA, let’s remove them from our dataset. dplyr offers a helpful function called select_if() that allows us to only keep columns satisfying a certain condition. select_if is an example of a scoped function; scoped functions apply operations on all variables within a specified subset. Where select requires us to specify column names to keep, select_if allows us to instead retain or remove columns based on a condition – which in our case pertains to whether the columns contain only NAs.

# first, define a function that checks whether not all elements are NA
not_all_na <- function(x) {
  return(!all(is.na(x))) # if all of the vector is just NA, return FALSE
}

# select_if can now use the above function
phe_ind <- phe_ind %>% 
  select_if(not_all_na)

glimpse(phe_ind)
## Observations: 85,693
## Variables: 33
## $ namedLocation                 <chr> "BLAN_061.phenology.phe", "BLAN_...
## $ domainID                      <chr> "D02", "D02", "D02", "D02", "D02...
## $ siteID                        <chr> "BLAN", "BLAN", "BLAN", "BLAN", ...
## $ plotID                        <chr> "BLAN_061", "BLAN_061", "BLAN_06...
## $ date                          <date> 2015-06-25, 2015-06-25, 2015-06...
## $ editedDateStat                <date> 2015-07-22, 2015-07-22, 2015-07...
## $ individualID                  <chr> "NEON.PLA.D02.BLAN.06286", "NEON...
## $ growthForm                    <chr> "Deciduous broadleaf", "Deciduou...
## $ phenophaseName                <chr> "Open flowers", "Increasing leaf...
## $ phenophaseStatus              <chr> "no", "no", "no", "no", "no", "n...
## $ phenophaseIntensityDefinition <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ phenophaseIntensity           <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ measuredByStat                <chr> "cVPbPdjHNiEVZ3Vlm83FuXHus5z3id4...
## $ recordedByStat                <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ remarksStat                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ decimalLatitude               <dbl> 39.05963, 39.05963, 39.05963, 39...
## $ decimalLongitude              <dbl> -78.07385, -78.07385, -78.07385,...
## $ elevation                     <dbl> 183, 183, 183, 183, 183, 183, 18...
## $ subtypeSpecification          <chr> "primary", "primary", "primary",...
## $ transectMeter                 <dbl> 506, 491, 491, 498, 506, 476, 50...
## $ directionFromTransect         <chr> "Right", "Left", "Left", "Right"...
## $ ninetyDegreeDistance          <dbl> 1.0, 0.5, 0.5, 2.0, 1.0, 2.0, 0....
## $ sampleGeodeticDatum           <chr> "WGS84", "WGS84", "WGS84", "WGS8...
## $ addDate                       <date> 2017-03-06, 2016-04-20, 2017-02...
## $ editedDate                    <date> 2017-04-07, 2016-05-09, 2017-03...
## $ taxonID                       <chr> "LOMA6", "RHDA", "LOMA6", "LOMA6...
## $ scientificName                <chr> "Lonicera maackii (Rupr.) Herder...
## $ identificationQualifier       <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ taxonRank                     <chr> "species", "species", "species",...
## $ measuredBy                    <chr> "cVPbPdjHNiEVZ3Vlm83FuXHus5z3id4...
## $ identifiedBy                  <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ recordedBy                    <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ remarks                       <chr> "dropped in order to have 30 LOM...

Subsetting the dataset

Let’s look at just the “Leaves” phenophase for “Deciduous broadleaf” plants.

plant_pheno <- phe_ind %>% 
  filter(phenophaseName == "Leaves" & growthForm == "Deciduous broadleaf")

Many of the columns aren’t that relevant for us, but some that we’re definitely interested in are date, phenophaseIntensity, and scientificName. Let’s take a look at what kind of factors we have in the last two columns.

plant_pheno %>% 
  count(scientificName)
plant_pheno %>% 
  count(phenophaseIntensity)

These are 7 species of tree / shrub in this subsetted dataset. Feel free to look up what the common names are for each of these; for example, Liriodendron tulipifera, the tulip tree, can be found along the US east coast as well as in Southern Ontario.

Lipidoptera tulipifera, by Jean-Pol GRANDMONT - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=9873223

Lipidoptera tulipifera, by Jean-Pol GRANDMONT - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=9873223

We’re going to look at how the phenophase intensity for “Leaves” changes over the course of time. Let’s look at the structure of the phenophaseIntensity column:

str(plant_pheno$phenophaseIntensity)
##  chr [1:12573] ">= 95%" ">= 95%" ">= 95%" ">= 95%" ">= 95%" ">= 95%" ...

We want to be able to use these as numbers, and since they’re binned in windows of percentage, we can manually convert the bin label to the numeric midpoint of the bin. This is approximate, and there are multiple valid choices you could make for how to assign a number to each bin. To do this, we’ll make use of the case_when() function, which allows us to mutate() on multiple conditions. case_when is very useful if there are multiple conditions that our mutate() operation needs to account for, as opposed to ifelse(), which can only handle one condition at a time. To use case_when, we first write out the condition followed by a ~ and the corresponding operation for that condition.

# convert phenophase intensities to numbers for plotting

# create new column called phenophaseIntensityMidpoint 
# and fill it with the numeric bin midpoint
plant_pheno <- plant_pheno %>% 
  mutate(
    phenophaseIntensityMidpoint = case_when(
      phenophaseIntensity == "< 5%" ~ 0.05 / 2,
      phenophaseIntensity == "5-24%" ~ (0.05 + 0.24) / 2,
      phenophaseIntensity == "25-49%" ~ (0.20 + 0.49) / 2,
      phenophaseIntensity == "50-74%" ~ (0.5 + 0.74) / 2,
      phenophaseIntensity == "75-94%" ~ (0.85 + 0.94) / 2,
      phenophaseIntensity == ">= 95%" ~ (1 + 0.95) / 2
    )
  )

# convert to numeric type explicitly
plant_pheno$phenophaseIntensityMidpoint <- as.numeric(plant_pheno$phenophaseIntensityMidpoint)

# check the structure of the new column
str(plant_pheno$phenophaseIntensityMidpoint)
##  num [1:12573] 0.975 0.975 0.975 0.975 0.975 0.975 0.975 0.975 0.975 0.975 ...

The last processing step is to make sure that the date column is a Date object.

# convert the date column to a date
plant_pheno$date <- as.Date(plant_pheno$date)

Plotting and fitting the data

Let’s plot the phenophase intensity over time, grouped by the species.

ggplot(plant_pheno, aes(x = date, y = phenophaseIntensityMidpoint, color = scientificName)) +
  geom_point() 

The pattern we would expect is already visible in this first plot - the leaves come out in the spring, then disappear again in October. But there might be differences between species and between individuals in a species. One way we could try to assess this is to fit the same model to subgroups of the data and then compare the fitted parameters to see if there are differences.

Fit an oscillatory model to the data

Let’s try to fit a sine wave to the phenophase intensity. A generic sine wave has four parameters:

\[y = A \text{sin}(kx - b) + c\] where \(k = 2\pi / T\), \(T\) is the period or length of time between peaks, \(A\) is the amplitude, \(b\) is the phase, and \(c\) is the vertical shift or offset.

Let’s plot a sine wave:

# plot a sine wave
x <- seq(0, 3, 0.01)
A <- 1
k <- 2*pi
b <- 0.5
c <- 0

qplot(x, A*sin(k*x-b)+c) +
  geom_line()

In order to use the date as a numeric x-axis for fitting purposes, we need to convert it from a date object to a numeric object.

# check the structure of the date column
str(plant_pheno$date)
##  Date[1:12573], format: "2015-06-25" "2015-06-25" "2015-06-25" "2015-06-25" "2015-06-25" ...
# create a list of all the dates in ascending order
dates = plant_pheno %>% 
  arrange(date) %>% 
  select(date)

# create a function to subtract two dates
subtract_dates <- function(date1, date2) {
  result <- date1 - date2
}

# create a numeric dates list to use in fitting
dates_numeric <- mapply(subtract_dates,
                 dates$date, # first argument in subtract_dates function 
                 dates$date[1]) # second argument in subtract_dates function - the first date

# add numeric dates to the dataframe
plant_pheno$date_numeric <- mapply(subtract_dates, 
                                                plant_pheno$date,
                                                dates$date[1]) # subtract the first date
# drop rows that have NA in phenophaseIntensityMidpoint column
plant_pheno <- 
  plant_pheno[!is.na(plant_pheno$phenophaseIntensityMidpoint),]

Let’s fit a particular individual’s phenophase intensity.

# find an individual
plant_pheno %>% 
  filter(scientificName == "Liriodendron tulipifera L.") %>% 
  select(scientificName, individualID) %>% 
  head(1)
# make a data frame with one individual
tulip_tree <- plant_pheno %>% 
  filter(individualID == "NEON.PLA.D02.SCBI.06071")

Now we will create a test sine function to get a rough idea for the parameters.

# period = 365 days
# wavenumber of sine function = 2 * pi/period 

sine_model <- function(x, amplitude, period, phase, offset) {
  return(amplitude*sin(2*pi/period*x + phase) + offset)
}
A <- 0.5 # phenophase intensity values go between 0 and 1
offset <- 0.5 # add 0.5 to make min 0 and max 1
period <- 365 # number of days in a year
phase <- 0.5 # this is a guess

guess_curve <- sine_model(dates_numeric, A, period, phase, offset)
qplot(x = dates$date, y = guess_curve) + # guess data
  geom_point(data = tulip_tree,  # actual data
       aes(x = date, y = phenophaseIntensityMidpoint, colour = scientificName)) 

We will only fit \(b\), the horizontal shift, since we already know the other parameters: we know that the minimum and maximum must be 0 and 1 since the intensity goes from 0 to 100%, and this sets both \(c\) and \(A\). We also know \(k\), since we know that the oscillation should go around once in a year (365 days). If we were being more fancy, we might want to take into account things like temperature and leap years; there is a pheno package in R specifically for plotting and analyzing phenological data.

# Make a range of b parameter values to try
b_vals <- seq(0.2, 0.8, by = 0.01)

# use the function 'sapply' to loop over b_vals list
resids_sq <- sapply(b_vals, function(b) {
    prediction <- 0.5*sin(2*pi/365*tulip_tree$date_numeric + b) +0.5
    residuals <- prediction - tulip_tree$phenophaseIntensityMidpoint
    sum(residuals^2)
})

Plotting the sum of the residuals squards vs. the fit parameter will show us if it worked: the best fit is the minimum of the curve.

qplot(b_vals, resids_sq)

We can see visually that the minimum is around \(b = 0.5\), but to extract that number from the list we can use the function which:

best_fit <- which(resids_sq == min(resids_sq))
b_fit <- b_vals[best_fit] 
b_fit
## [1] 0.54

Finally, let’s plot the fit against the original data:

ggplot(data = tulip_tree, aes(x = date, y = phenophaseIntensityMidpoint)) +
  geom_point() + 
  geom_point(aes(x = date, y = 0.5*sin(2*pi/365*tulip_tree$date_numeric + b_fit) +0.5, colour = 'red'))

Not bad! Next, we’ll do this for every single individual in the entire dataset. With fits for each individual, we can look at the distribution of the phase shifts for each species to get a sense of when each species gets its leaves.

Fit each individual separately

# Make a range of b parameter values to try
b_vals <- seq(0, 1.1, by = 0.015)

# create a function that does least squares for this model
least_squares <- function(df, b_vals) {
  resids_sq <- sapply(b_vals, function(b) {
    prediction <- 0.5*sin(2*pi/365*df$date_numeric + b) +0.5
    residuals <- prediction - df$phenophaseIntensityMidpoint
    sum(residuals^2)
    })
  return(data.frame(b_vals, resids_sq))
}

# create a data frame that contains the residuals grouped by species
resids_sq_individuals <- plant_pheno %>%
  group_by(individualID, scientificName) %>% 
  do(data.frame(val=least_squares(., b_vals)))
ggplot(resids_sq_individuals, aes(x=val.b_vals, y = val.resids_sq, colour = scientificName)) +
  geom_point()

Get the best fit \(b\) value for each individual:

# store the best fit values in a new data frame
b_df_all <- resids_sq_individuals %>% 
  group_by(individualID, scientificName) %>% 
  summarize(phase = b_vals[which(val.resids_sq== min(val.resids_sq))])

head(b_df_all)

We have best fit values for the phase for every individual plant in our dataset. To visualize this, we will make a violin plot of the phases, grouped by species.

# violin plot
ggplot(b_df_all, aes(x = scientificName, y = phase, colour = scientificName)) +
  geom_violin(scale = 'count', draw_quantiles = 0.5) +
  geom_jitter(height = 0, width = 0.1) +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

The phase \(b\) is a positive number for each of these, and in our model we fit the curve like this:

\[y = A\text{sin}(kx + b) + c\] A positive \(b\) shifts the curve to the left, so our model suggests that species with larger \(b\) either get their leaves earlier, lose their leaves earlier, or peak earlier, or some combination of all three. I’m not a phenologist, but we can do some reading to see if the general pattern we observe is really there: Lonicera maackii, Amur honeysuckle (an invasive species), does in fact leaf out very early in the spring and keeps its leaves longer than other plants, according to this paper (McEwan et al. 2009).

This actually suggests a limitation to our approach: by fitting the same period of sine curve to each dataset, we have no information on how long the leaf season is for each plant: we have no way of distinguishing a plant that leafs earlier and loses its leaves earlier from a plant that peaks earlier and keeps its leaves longer. It’s hard to interpret what \(b\) actually means in terms of the phenophase. Let’s plot the smallest and largest \(b\) species together:

plant_pheno %>% 
  filter(taxonID == "LOMA6" | taxonID == "JUNI") %>% # easier than typing the full scientific name
  ggplot(aes(x=date, y = phenophaseIntensityMidpoint, colour = scientificName)) +
  geom_point()

By examining these two species side by side, we can see that Lonicera maackii peaks at about the same time as Juglans nigra, but it gets leaves earlier and keeps its leaves later. To actually distinguish each of these events, we would need a more complicated model with more fit parameters.

When phenologists model phenophase changes, they often use growing degree days (GDD) instead of the true date, since plants respond to environmental signals like temperature and not to the calendar date. In the extension below we will calculate cumulative growing degree days and plot the phenophase intensity as above vs. cumulative growing degree days.

Extension - Growing Degree Days

There are several methods to calculate degree days (Battel 2017; Nugent, n.d.; Miller, Lanier, and Brandt 2018). Here we will use a simple method calculated by averaging the maximum and minimum temperatures in a day and subtracting a base temperature below which no growth is expected:

\[\text{GDD} = \frac{\text{max temp} - \text{min temp}}{2} - \text{base temp}\]

Negative GDD values are set to zero. To choose a base temperature, we ideally need to know how the species we’re investigating grows. For illustration purposes here, we’ll choose a base temperature of 10 degrees Celsius (50 degrees Fahrenheit).

To combine this information with the phenology information above, we will use air temperature measurements collected every 30 minutes at two of the NEON sites we’re looking at.

# 30 minute averages of 1Hz observations of air temperature in degrees Celsius
temp <- read_csv('NEON-pheno-temp-timeseries/temp/SAAT_30min.csv') 

# what are the column names?
names(temp)
##  [1] "domainID"            "siteID"              "horizontalPosition" 
##  [4] "verticalPosition"    "startDateTime"       "endDateTime"        
##  [7] "tempSingleMean"      "tempSingleMinimum"   "tempSingleMaximum"  
## [10] "tempSingleVariance"  "tempSingleNumPts"    "tempSingleExpUncert"
## [13] "tempSingleStdErMean" "finalQF"

First, let’s convert the date and time column to a date column named date, and the temperature measurements to a numeric column.

temp <- temp %>%
    mutate(date = as.Date(startDateTime),
           tempSingleMean = as.numeric(tempSingleMean))

We need the max and min temperatures to calculate growing degree days using the formula above. Here, we’ll create a new data frame called temp_max_min that has the date, site ID, and max and min temperatures for that day.

# create temp_max_min data frame
temp_max_min <- temp %>% 
  filter(!is.na(tempSingleMean)) %>% # remove NA values from temperature measurements
  group_by(date, siteID) %>% 
  # create max and min columns
  summarize(max = max(tempSingleMean), min = min(tempSingleMean)) 

head(temp_max_min, 3)

Next, we’ll actually calcuale growing degree days. We define a baseline temperature of 10 Celsius, then average the max and min columns and subtract the baseline temperature. Finally, we use mutate and case_when to convert negative GDD values to zero.

# define baseline temp
baseline_temp <- 10 # degrees Celsius
# calculate GDD and assign to new column
temp_max_min$GDD <- (temp_max_min$max + temp_max_min$min) / 2 - baseline_temp 

# cut off GDD at 0: negative values are assumed to be 0.
temp_max_min <- temp_max_min %>%  mutate(
    GDD = case_when(
      GDD < 0 ~ 0,
      GDD >= 0 ~ GDD))

We want to be able to group these results by year, since GDD resets at the start of each growing season. We create a new column year by formatting the date column to select just the year.

temp_max_min$year <- format(temp_max_min$date, "%Y") # make a new column with the year

Now we can calculate the cumulative growing degree days - the cumulative effect of temperature that the plant will feel. We group by year and site, since temperature measurements are specific to site and cumulative GDD should reset at the start of each year.

# calculate cumulative GDD for each site and year separately
temp_max_min <- temp_max_min %>% 
  group_by(siteID, year) %>% 
  arrange(date) %>% 
  mutate(GDD_cumulative = cumsum(GDD))

head(temp_max_min, 3)

To plot phenophase vs. GDD, we will join this temperature data with the plant_pheno data we had from before. First, we find which columns are overlapping to confirm that we want to join on those, and then we filter for only the year 2016 since the 2015 temperature data didn’t start until April.

# where is there an intersection of names?
sameName <- intersect(names(temp_max_min), names(plant_pheno))
sameName
## [1] "date"   "siteID"
# take just the year 2016 - it's a complete year
temp_max_min_2016 <- filter(temp_max_min, year == 2016)

# create a new data frame with the results of the join
plant_pheno_temp <- left_join(temp_max_min_2016, plant_pheno)
## Joining, by = c("date", "siteID")
head(plant_pheno_temp, 3)

Let’s plot phenophase intensity, coloured by species, vs. cumulative GDD. We can see a nice pattern - all the plants accumulate leaf cover as GDD accumulates.

ggplot(plant_pheno_temp, 
       aes(x = GDD_cumulative, 
           y = phenophaseIntensityMidpoint, 
           color = scientificName)) + 
  geom_point() 

Realistically we should only be looking at leaves coming out, not leaf loss in the fall - there’s probably a better model for leaf loss than GDD. We can filter out all data with GDD_cumulative \(< 1000\), since by that point all plants are at 95-100% cover.

plant_pheno_temp %>% 
  filter(GDD_cumulative <= 1000) %>% 
  ggplot(aes(x = GDD_cumulative, 
             y = phenophaseIntensityMidpoint, 
             color = scientificName)) +
   geom_point(alpha = 0.5) 

Here we get a hint of of the same information we saw earlier, namely that some species get their leaves earlier than others. We could fit this data to get a quantitative estimate of how leaf cover depends on GDD.

References

Battel, Bob. 2017. “Understanding growing degree-days.” https://www.canr.msu.edu/news/understanding{\_}growing{\_}degree{\_}days.

McEwan, Ryan W, M Keith Birchfield, Angela Schoergendorfer, and Mary A Arthur. 2009. “Leaf phenology and freeze tolerance of the invasive shrub Amur honeysuckle and potential native competitors.” The Journal of the Torrey Botanical Society 136 (2): 212–20. doi:10.3159/08-RA-109.1.

Miller, Perry, Will Lanier, and Stu Brandt. 2018. “Using Growing Degree Days to Predict Plant Stages.” Montguide, no. MT200103 AG. doi:10.1111/j.1744-7348.1991.tb0489.

Nugent, Jim. n.d. “Calculating growing degree days.” Michigan State University Extension, 1–2.