Introduction

Phenology is the study of periodic or cyclic natural phenomena, and this dataset contains observations of the seasonal cycles of plants at three NEON sites in the US: Blandy Experimental Farm (BLAN) and the Smithsonian Conservation Biology Institute (SCBI) in Virginia, and the Smithsonian Environmental Research Center (SERC) in Maryland.

In this tutorial, we’ll plot some phenology data and fit an oscillatory model to determine when different species get and lose their leaves.

Load and organize data

Clean up and merge the two data frames

The first part of this tutorial is based on a data cleanup from https://www.neonscience.org/neon-plant-pheno-data-r. There are two data files we’ll be combining: the first (phe_perindividual.csv) contains data for each individual organism observed such as its location and scientific name, and the second (phe_statusintensity.csv) contains observations of several phenophases and their intensity.

library(tidyverse)

# load the two data files 
ind <- read_csv('NEON-pheno-temp-timeseries/pheno/phe_perindividual.csv')

status <- read_csv('NEON-pheno-temp-timeseries/pheno/phe_statusintensity.csv')

names(ind)
##  [1] "uid"                         "namedLocation"              
##  [3] "domainID"                    "siteID"                     
##  [5] "plotID"                      "decimalLatitude"            
##  [7] "decimalLongitude"            "geodeticDatum"              
##  [9] "coordinateUncertainty"       "elevation"                  
## [11] "elevationUncertainty"        "subtypeSpecification"       
## [13] "transectMeter"               "directionFromTransect"      
## [15] "ninetyDegreeDistance"        "sampleLatitude"             
## [17] "sampleLongitude"             "sampleGeodeticDatum"        
## [19] "sampleCoordinateUncertainty" "sampleElevation"            
## [21] "sampleElevationUncertainty"  "addDate"                    
## [23] "editedDate"                  "individualID"               
## [25] "taxonID"                     "scientificName"             
## [27] "identificationQualifier"     "taxonRank"                  
## [29] "growthForm"                  "vstTag"                     
## [31] "samplingProtocolVersion"     "measuredBy"                 
## [33] "identifiedBy"                "recordedBy"                 
## [35] "remarks"                     "dataQF"
names(status)
##  [1] "uid"                           "namedLocation"                
##  [3] "domainID"                      "siteID"                       
##  [5] "plotID"                        "date"                         
##  [7] "editedDate"                    "dayOfYear"                    
##  [9] "individualID"                  "taxonID"                      
## [11] "scientificName"                "growthForm"                   
## [13] "phenophaseName"                "phenophaseStatus"             
## [15] "phenophaseIntensityDefinition" "phenophaseIntensity"          
## [17] "samplingProtocolVersion"       "measuredBy"                   
## [19] "recordedBy"                    "remarks"                      
## [21] "dataQF"

We’ll remove the UID field which uniquely identifies each row – we don’t need to uniquely identify entries.

ind <- select(ind, -uid)
status <- select(status, -uid)

Next, we remove duplicates that result from stacking tables over many months – we don’t need duplicate observations from the same date, tree, and site.

ind_noD <- distinct(ind)
nrow(ind) - nrow(ind_noD) # how many rows were removed
## [1] 245
status_noD <- distinct(status)
nrow(status) - nrow(status_noD) # how many rows were removed
## [1] 1599

Some of the columns in status_noD that have the same name as those in ind_noD should be renamed so that we can join the two tables and keep those fields separate. When using rename(), remember that each new column name is written first, followed by the previous name (i.e. new_name = old_name).

# where is there an intersection of names?
sameName <- intersect(names(status_noD), names(ind_noD))
sameName
##  [1] "namedLocation"           "domainID"               
##  [3] "siteID"                  "plotID"                 
##  [5] "editedDate"              "individualID"           
##  [7] "taxonID"                 "scientificName"         
##  [9] "growthForm"              "samplingProtocolVersion"
## [11] "measuredBy"              "recordedBy"             
## [13] "remarks"                 "dataQF"
# rename status editedDate
status_noD <- rename(status_noD, editedDateStat = editedDate, 
        measuredByStat = measuredBy, recordedByStat = recordedBy, 
        samplingProtocolVersionStat = samplingProtocolVersion, 
        remarksStat = remarks, dataQFStat = dataQF)

Next, we’ll convert the date column to R’s Date object type. This makes R correctly interpret dates when performing operations down the line.

# convert column to date class
ind_noD$editedDate <- as.Date(ind_noD$editedDate)
str(ind_noD$editedDate)
##  Date[1:1557], format: "2015-07-22" "2015-07-22" "2015-07-22" "2015-07-22" "2015-07-22" ...
status_noD$date <- as.Date(status_noD$date)
str(status_noD$date)
##  Date[1:85693], format: "2015-06-25" "2015-06-25" "2015-06-25" "2015-06-25" "2015-06-25" ...
# retain only the max of the date for each individualID
ind_last <- ind_noD %>%
    group_by(individualID) %>%
    filter(editedDate == max(editedDate))

# oh wait, duplicate dates, retain only one
ind_lastnoD <- ind_last %>%
    group_by(editedDate, individualID) %>%
    filter(row_number() == 1)

Remove these two columns that will be duplicated in the ind_noD data frame.

status_noD <- select(status_noD, -taxonID, -scientificName)

Join the dataframes:

# Create a new dataframe "phe_ind" with all the data from status and some from ind_lastnoD
phe_ind <- left_join(status_noD, ind_lastnoD)
## Joining, by = c("namedLocation", "domainID", "siteID", "plotID", "individualID", "growthForm")
glimpse(phe_ind)
## Observations: 85,693
## Variables: 47
## $ namedLocation                 <chr> "BLAN_061.phenology.phe", "BLAN_...
## $ domainID                      <chr> "D02", "D02", "D02", "D02", "D02...
## $ siteID                        <chr> "BLAN", "BLAN", "BLAN", "BLAN", ...
## $ plotID                        <chr> "BLAN_061", "BLAN_061", "BLAN_06...
## $ date                          <date> 2015-06-25, 2015-06-25, 2015-06...
## $ editedDateStat                <date> 2015-07-22, 2015-07-22, 2015-07...
## $ dayOfYear                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ individualID                  <chr> "NEON.PLA.D02.BLAN.06286", "NEON...
## $ growthForm                    <chr> "Deciduous broadleaf", "Deciduou...
## $ phenophaseName                <chr> "Open flowers", "Increasing leaf...
## $ phenophaseStatus              <chr> "no", "no", "no", "no", "no", "n...
## $ phenophaseIntensityDefinition <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ phenophaseIntensity           <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ samplingProtocolVersionStat   <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ measuredByStat                <chr> "cVPbPdjHNiEVZ3Vlm83FuXHus5z3id4...
## $ recordedByStat                <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ remarksStat                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ dataQFStat                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ decimalLatitude               <dbl> 39.05963, 39.05963, 39.05963, 39...
## $ decimalLongitude              <dbl> -78.07385, -78.07385, -78.07385,...
## $ geodeticDatum                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ coordinateUncertainty         <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ elevation                     <dbl> 183, 183, 183, 183, 183, 183, 18...
## $ elevationUncertainty          <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ subtypeSpecification          <chr> "primary", "primary", "primary",...
## $ transectMeter                 <dbl> 506, 491, 491, 498, 506, 476, 50...
## $ directionFromTransect         <chr> "Right", "Left", "Left", "Right"...
## $ ninetyDegreeDistance          <dbl> 1.0, 0.5, 0.5, 2.0, 1.0, 2.0, 0....
## $ sampleLatitude                <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sampleLongitude               <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sampleGeodeticDatum           <chr> "WGS84", "WGS84", "WGS84", "WGS8...
## $ sampleCoordinateUncertainty   <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sampleElevation               <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sampleElevationUncertainty    <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ addDate                       <date> 2017-03-06, 2016-04-20, 2017-02...
## $ editedDate                    <date> 2017-04-07, 2016-05-09, 2017-03...
## $ taxonID                       <chr> "LOMA6", "RHDA", "LOMA6", "LOMA6...
## $ scientificName                <chr> "Lonicera maackii (Rupr.) Herder...
## $ identificationQualifier       <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ taxonRank                     <chr> "species", "species", "species",...
## $ vstTag                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ samplingProtocolVersion       <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ measuredBy                    <chr> "cVPbPdjHNiEVZ3Vlm83FuXHus5z3id4...
## $ identifiedBy                  <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ recordedBy                    <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ remarks                       <chr> "dropped in order to have 30 LOM...
## $ dataQF                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...

Dealing with NAs

Some columns look like they might only have NAs. We can check the proportion of NAs by first summing the number of NA values per column with map_df() from the purrr package, followed by standard dplyr/tidyr operations to reshape the data frame.

map_df works by performing an operation on each of the columns in its input data frame and returning a data frame as output. The ~ character is shorthand for ‘function’, and tells R that the . within is.na() refers to individual columns and not the entire input data frame (as would otherwise be the case with the %>% operator).

# Some columns look like they might only have NAs. Check this:
phe_ind %>% 
  map_df(~ sum(is.na(.))) %>% # counts NA values in each column
  gather(column, values) %>% # reshapes data frame
  mutate(proportion_na = values / nrow(phe_ind)) %>% 
  arrange(desc(proportion_na)) # arrange in descending order of proportion_na

Now that we’ve identified which columns are only NA, let’s remove them from our dataset. dplyr offers a helpful function called select_if() that allows us to only keep columns satisfying a certain condition. select_if is an example of a scoped function; scoped functions apply operations on all variables within a specified subset. Where select requires us to specify column names to keep, select_if allows us to instead retain or remove columns based on a condition – which in our case pertains to whether the columns contain only NAs.

# first, define a function that checks whether not all elements are NA
not_all_na <- function(x) {
  return(!all(is.na(x))) # if all of the vector is just NA, return FALSE
}

# select_if can now use the above function
phe_ind <- phe_ind %>% 
  select_if(not_all_na)

glimpse(phe_ind)
## Observations: 85,693
## Variables: 33
## $ namedLocation                 <chr> "BLAN_061.phenology.phe", "BLAN_...
## $ domainID                      <chr> "D02", "D02", "D02", "D02", "D02...
## $ siteID                        <chr> "BLAN", "BLAN", "BLAN", "BLAN", ...
## $ plotID                        <chr> "BLAN_061", "BLAN_061", "BLAN_06...
## $ date                          <date> 2015-06-25, 2015-06-25, 2015-06...
## $ editedDateStat                <date> 2015-07-22, 2015-07-22, 2015-07...
## $ individualID                  <chr> "NEON.PLA.D02.BLAN.06286", "NEON...
## $ growthForm                    <chr> "Deciduous broadleaf", "Deciduou...
## $ phenophaseName                <chr> "Open flowers", "Increasing leaf...
## $ phenophaseStatus              <chr> "no", "no", "no", "no", "no", "n...
## $ phenophaseIntensityDefinition <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ phenophaseIntensity           <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ measuredByStat                <chr> "cVPbPdjHNiEVZ3Vlm83FuXHus5z3id4...
## $ recordedByStat                <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ remarksStat                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ decimalLatitude               <dbl> 39.05963, 39.05963, 39.05963, 39...
## $ decimalLongitude              <dbl> -78.07385, -78.07385, -78.07385,...
## $ elevation                     <dbl> 183, 183, 183, 183, 183, 183, 18...
## $ subtypeSpecification          <chr> "primary", "primary", "primary",...
## $ transectMeter                 <dbl> 506, 491, 491, 498, 506, 476, 50...
## $ directionFromTransect         <chr> "Right", "Left", "Left", "Right"...
## $ ninetyDegreeDistance          <dbl> 1.0, 0.5, 0.5, 2.0, 1.0, 2.0, 0....
## $ sampleGeodeticDatum           <chr> "WGS84", "WGS84", "WGS84", "WGS8...
## $ addDate                       <date> 2017-03-06, 2016-04-20, 2017-02...
## $ editedDate                    <date> 2017-04-07, 2016-05-09, 2017-03...
## $ taxonID                       <chr> "LOMA6", "RHDA", "LOMA6", "LOMA6...
## $ scientificName                <chr> "Lonicera maackii (Rupr.) Herder...
## $ identificationQualifier       <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ taxonRank                     <chr> "species", "species", "species",...
## $ measuredBy                    <chr> "cVPbPdjHNiEVZ3Vlm83FuXHus5z3id4...
## $ identifiedBy                  <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ recordedBy                    <chr> "UPKVQ90CmewX9vOGBMiPV2gMtRUi+WU...
## $ remarks                       <chr> "dropped in order to have 30 LOM...

Subsetting the dataset

Let’s look at just the “Leaves” phenophase for “Deciduous broadleaf” plants.

plant_pheno <- phe_ind %>% 
  filter(phenophaseName == "Leaves" & growthForm == "Deciduous broadleaf")

Many of the columns aren’t that relevant for us, but some that we’re definitely interested in are date, phenophaseIntensity, and scientificName. Let’s take a look at what kind of factors we have in the last two columns.

plant_pheno %>% 
  count(scientificName)
plant_pheno %>% 
  count(phenophaseIntensity)

These are 7 species of tree / shrub in this subsetted dataset. Feel free to look up what the common names are for each of these; for example, Liriodendron tulipifera, the tulip tree, can be found along the US east coast as well as in Southern Ontario.