#' ---
#' title: "Remote Sensing Data Processing"
#' subtitle: "NEON airborne remote sensing data exploration"
#' author: "Yingying Xie"
#' date: "March 3, 2020"
#' output: 
#'   html_document:
#'     df_print: paged
#'   pdf_document:
#'     fig_width: 5 
#'     fig_height: 3 
#' ---
#' 
## ----setup, include=FALSE------------------------------------------------------------------------------------------------------------------------------
#knitr::opts_chunk$set(cache = TRUE)

#' 
#' # NEON Airborne Observation Platform (AOP) data
#' 
#' Remote sensing (AOP) - Data collected by the airborne observation platform, e.g. LIDAR, surface reflectance
#' 
#' There are 29 AOP data products available from NEON data portal. We will use six of them:
#' * Ecosystem structure (canopy height)
#' * Biomass
#' * Water indices
#' * Elevation
#' * Slope and Aspect
#' * Vegetation indices
#' 
#' The spatial resolution of these AOP remote sensing data is 1 meter. The data were generated into tiles of 1*1km, which are available at NEON data portal (https://data.neonscience.org/home). 
#' 
#' To explore information of these data products (29 AOP products from 53 sites), go to this web site: https://data.neonscience.org/data-products/explore. 
#' 
#' 
## ---- warning=FALSE, message=FALSE---------------------------------------------------------------------------------------------------------------------
# load packages
library(sf)
library(rgdal)
library(raster)
library(rasterVis)
library(tidyverse)

#' 
#' ## Explore AOP data
#' 
#' There are 81 NEON sites with multiple plots, flux tower observations, and other observation settings. You may check the site information at this web site: https://www.neonscience.org/field-sites/field-sites-map/list.
#' 
#' In this practice, we will examine three sites: 
#' * Harvard Forest - HARV
#' * Wind River Experimental Forest - WREF
#' * Yellowstone Northern Range (Frog Rock) - YELL
#' 
#' One tile data of six AOP products at these three sites have been downloaded from NEON data portal. You should access the data through the link on Canvas and download them to your local disk. All data files are Tiff files.
#' 
#' Let's first explore the data at Harvard Forest site.
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# read AOP data
# make sure that you change the path to your file directory
harv_ch2017=raster("09_data/HARV_2017/NEON_D01_HARV_DP3_732000_4713000_CHM.tif")

# check metadata
harv_ch2017

# plot it
plot(harv_ch2017)

#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# let's use ggplot to graph the data, so we can have more management on the graph
# convert raster data to a data frame
df_harv_ch2017=as.data.frame(harv_ch2017,xy=T,long=T)

# plot it using ggplot
ggplot(df_harv_ch2017)+
  geom_raster(aes(x=x,y=y,fill=value))+
  # use a color scheme with gradients defined by four colors
  scale_fill_gradientn(
    colours=c("orange","yellow","green","darkgreen"),
    name="Canopy Height")+
  coord_equal()

#' 
#' From this plot, we may see the general spatial pattern of canopy height in this 1*1km tile. The area with brown color might be the bare land with no or very low amount of vegetation (low canopy height), and the area with green color represent forested area (high canopy height).
#' 
#' Now let's check the biomass data using the same way shown above.
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
harv_bm2017=raster("09_data/HARV_2017/NEON_D01_HARV_DP3_732000_4713000_Biomass.tif")

harv_bm2017

df_harv_bm2017=as.data.frame(harv_bm2017,xy=T,long=T)

ggplot(df_harv_bm2017)+
  geom_raster(aes(x=x,y=y,fill=value))+
  # use a sequential color scheme "YlGn" 
  scale_fill_distiller( palette = "YlGn", direction = 1, name="Biomass")+ 
  coord_equal()

#' 
#' Apparently, most areas have a high amount of biomass as the observation was done in summer time, that's why we see a lot of darkgreen in the plot.
#' 
#' What about the topography in this region?
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
harv_el2017=raster("09_data/HARV_2017/NEON_D01_HARV_DP3_732000_4713000_DSM.tif")

harv_el2017

df_harv_el2017=as.data.frame(harv_el2017,xy=T,long=T)

ggplot(df_harv_el2017)+
  geom_raster(aes(x=x,y=y,fill=value))+
  # use a terrain color scheme for visualizing topography
  scale_fill_gradientn( colours=terrain.colors(100),
    name="Elevation (m)")+
  coord_equal()

#' 
#' It seems the northwestern area has high elevation and the southeastern area is relatively low in elevation.
#' 
#' Now it's your turn to explore other AOP data at Harvard Forest. Refer to the code above to check the following data. Make the plots and briefly describe the information shown in the plots.
#' 
#' 1. Vegetation index - NDVI (Normalized Difference Vegetation Index)
#' 
#' The Normalized Difference Vegetation Index (NDVI) is one of the most widely used indicator that describes the greenness - the relative density and health of vegetation, based on red and near-infrared light waves reflected by land surfaces. 
#' 
#' NDVI values range from -1.0 to +1.0, but the common range of NDVI values is 0 to 1. Areas of barren rock, sand, or snow usually show very low NDVI values (for example, 0.1 or less). Sparse vegetation such as shrubs and grasslands or senescing crops may result in moderate NDVI values (approximately 0.2 to 0.5). High NDVI values (approximately 0.6 to 0.9) correspond to dense vegetation such as that found in temperate and tropical forests or crops at their peak growth stage. 
#' 
## ---- eval=F-------------------------------------------------------------------------------------------------------------------------------------------
## # let's start from here
## harv_ndvi2017=raster("09_data/HARV_2017/VegIndices/NEON_D01_HARV_DP3_732000_4713000_NDVI.tif")

#' 
#' Write your code below:
#' 
#' ***
#' ***
#' ***
#' 
#' 
#' 
#' 
df_harv_ndvi2017=as.data.frame(harv_ndvi2017,xy=T,long=T)

ggplot(df_harv_ndvi2017)+
  geom_raster(aes(x=x,y=y,fill=value))+
  scale_fill_distiller( palette = "YlGn", direction = 1, name="NDVI")+
  coord_equal()
#' 
#' What do you know about the forests from NDVI values shown in your graph?
#' 
#' 
#' 2. Water index - NDWI (Normalized Difference Water Index)
#' 
#' The Normalized Difference Water Index (NDWI) is a remote sensing derived index estimating the leaf water content at canopy level. NDWI is known to be strongly related to the plant water content. It is therefore a very good proxy for plant water stress.
#' 
#' The NDWI value varies between -1 to +1, depending on the leaf water content but also on the vegetation type and cover. High values of NDWI correspond to high vegetation water content and to high vegetation fraction cover. Low NDWI values correspond to low vegetation water content and low vegetation fraction cover. In period of water stress, NDWI will decrease. 
#' 
## ---- eval=F-------------------------------------------------------------------------------------------------------------------------------------------
## # let's start from here
## harv_ndwi2017=raster("09_data/HARV_2017/WaterIndices/NEON_D01_HARV_DP3_732000_4713000_NDWI.tif")

#' 
#' Write your code below:
#' 
#' ***
#' ***
#' ***
#' 
#' 
#' 
#' 
#' 
df_harv_ndwi2017=as.data.frame(harv_ndwi2017,xy=T,long=T)

ggplot(df_harv_ndwi2017)+
  geom_raster(aes(x=x,y=y,fill=value))+
  # as NDWI varis from -1 to 1, a diverging color scheme is better
  scale_fill_distiller( palette = "RdYlBu", direction = 1, name="NDWI")+ 
  coord_equal()
#' 
#' What do you know about the forests from NWVI values shown in your graph?
#' 
#' 
#' 
#' ## Extract data for observatory plots
#' 
#' 
#' At each site, we are interested in one observatory plots representing local dominant vegetation type. For example, dominant vegetation type at Harvard Forest is deciduous forest. They are relatively small sites (40m*40m) to observe specific biolgoical and ecological processes. To extract data for the plots, we need to use the spaital information of the plots from the spatial data provided by NEON. 
#' 
#' By exploring the site information of Harvard Forest at https://www.neonscience.org/field-sites/field-sites-map/HARV. We determined two plots that we are interested in: HARV_036, and HARV_045.
#' 
#' 
## ---- results=F----------------------------------------------------------------------------------------------------------------------------------------
# read NEON plots shapefile
plot_polygons <- read_sf("09_data/All_NEON_TOS_Plots_V7/All_NEON_TOS_Plot_Polygons.shp",  quiet = T)
plot_polygons
colnames(plot_polygons)

# filter plot polygon for two representative plots: HARV_036 and HARV_045
HARV_plots=filter(plot_polygons,plotID %in% c("HARV_036","HARV_045"))

HARV_plots

#' 
#' 
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# convert datasets for visualization
HARV_plots=st_transform(HARV_plots,crs(harv_ch2017))


# plot the tile and the plots
ggplot(df_harv_ch2017)+
  geom_raster(aes(x=x,y=y,fill=value))+
  scale_fill_gradientn(
    colours=c("orange","yellow","green","darkgreen"),
    name="Canopy Height")+
  geom_sf(data=HARV_plots,fill=NA,color="black")+
  xlab("longitude")+ylab("latitude")

#' 
#' To better examine the detailed information for the plots, we may extract the remote sensing data for the plot area.
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# Check plots information
HARV_plots$plotID

# crop canopy height data based on polygon of one plot
harv_036=crop(harv_ch2017,extent(HARV_plots[1,])) # first polygon element of HARV_plots is for HARV_036

harv_036

#' 
#' 
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# plot canopy height for HARV_036 plot area
df_harv_036=as.data.frame(harv_036,xy=T,long=T)

ggplot(df_harv_036)+
  geom_raster(aes(x=x,y=y,fill=value))+
  scale_fill_gradientn(
    colours=c("orange","yellow","green","darkgreen"),
    name="Canopy Height")+
  coord_equal()

#' 
#' From the graph, you can see that this plot is 40 by 40 meters. Each pixel in the graph represent 1 by 1 meter area, as the spatial resolution of the data is 1 meter.
#' 
#' Now it's your turn to extract and graph the canopy height data for HARV_045 plot.
#' 
#' Write your code below:
#' 
#' ***
#' ***
#' ***
#' 
#' 
#' 
#' 
#' 
harv_045=crop(harv_ch2017,extent(HARV_plots[2,])) 

df_harv_045=as.data.frame(harv_045,xy=T,long=T)

ggplot(df_harv_045)+
  geom_raster(aes(x=x,y=y,fill=value))+
  scale_fill_gradientn(
    colours=c("orange","yellow","green","darkgreen"),
    name="Canopy Height")+
  coord_equal()
#' 
#' 
#' 
#' ## Stack and manage data for multiple products and plots
#' 
#' As there are remote sensing data from 9 AOP products of 3 sites, data loading and processing can be inconvenient when we read each data to one object, which lead to too many data objects. Stacking similar data into a RasterStack or RasterBrick object with multiple raster layers would be more efficient for data management and processing. 
#' 
#' ### Create multi-layer data
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
#generate a list of input rasters that have the same extent and spaital resolution
#pattern = "*.tif$" - filters for main raster files 
files <- list.files("09_data/HARV_2017" , pattern = "*.tif$")
veg <- list.files("09_data/HARV_2017/VegIndices" , pattern = "*.tif$")   # Vegetation indices

#create a raster stack from the input raster files 
s1 <- raster::stack(paste0("09_data/HARV_2017/", files))
s2 <- raster::stack(paste0("09_data/HARV_2017/VegIndices/", veg))

# simplify the name for each layer
names=substr(names(s1),34,nchar(names(s1)))
names(s1)=names

names=substr(names(s2),34,nchar(names(s2)))
names(s2)=names


#' 
#' 
#' ### Export plot level data
#' 
#' As we have extracted data for observatory plots, we want to export the plot level data to a data file, so we can work on them in the future without extracting them again.
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
#write the output raster to a grid file
#although you can also output to a GeoTiff file, a grid file preserves the names for each layer, so when you read the data next time, you still know what data each layer is
writeRaster(s1, filename = "09_data/HARV_2017/HARV_2017_BioTopo.grd", format="raster", overwrite=TRUE)
writeRaster(s2, filename = "09_data/HARV_2017/HARV_2017_VegIndices.grd", format="raster", overwrite=TRUE)

#' 
## ---- results=F----------------------------------------------------------------------------------------------------------------------------------------
# you can test it using the code below
r=brick(("09_data/HARV_2017/HARV_2017_BioTopo.grd"))

r

#' 
#' 
#' Now it is your turn to stack all water indices layers to a RasterStack object, then export it to a grid file.
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# let's start from here
water <- list.files("09_data/HARV_2017/WaterIndices" , pattern = "*.tif$")   # Water indices

#' 
#' Write your code below:
#' 
#' ***
#' ***
#' ***
#' 
#' 
#' 
#' 
s3 <- raster::stack(paste0("09_data/HARV_2017/WaterIndices/", water))

names=substr(names(s3),34,nchar(names(s3)))
names(s3)=names

writeRaster(s3, filename = "09_data/HARV_2017/HARV_2017_WaterIndices.grd", format="raster", overwrite=TRUE)

#' 
#' While we have multi-layer remote sensing data, it is easier to extract plot level data for all layers from multiple data products
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# extract plot data
harv_036=crop(r,extent(HARV_plots[1,])) 
harv_045=crop(r,extent(HARV_plots[2,])) 

# check plot data
harv_036


#' 
#' ## Compare plots
#' 
#' Now we can dig details for the observatory plots. For example, although both the two plots represent deciduous forest dominant vegetation, are there any differences in their biological and ecological status? 
#' 
#' To visualize the differences between two plots, we need to graph them together using a same color scheme.
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# convert raster data to data frame, then combine data of two plots
df_plots=rbind(as.data.frame(harv_036,xy=T,long=T), as.data.frame(harv_045,xy=T,long=T))
head(df_plots)
dim(df_plots)

# add plotID column
df_plots$plotID=c(rep("HARV_036",40*40*6),rep("HARV_045",40*40*6))


#' 
## ---- warning=FALSE------------------------------------------------------------------------------------------------------------------------------------
# compare canopy height
ggplot(filter(df_plots,layer=="CHM"))+
  geom_raster(aes(x=x,y=y,fill=value))+
  scale_fill_gradientn(
    colours=c("orange","yellow","green","darkgreen"),
    name="Canopy Height")+
  facet_wrap(~plotID, scale="free_x")

#' 
## ---- message=FALSE------------------------------------------------------------------------------------------------------------------------------------
# compare the histogram of canopy height
meanch=filter(df_plots,layer=="CHM") %>%
  group_by(plotID) %>% 
  summarise(mean(value)) %>% 
  as.data.frame()


ggplot(data=filter(df_plots,layer=="CHM"),aes(x=value, fill=plotID)) + 
  geom_histogram(position="identity",alpha=0.5)+
  # add mean value
  geom_vline(xintercept=as.numeric(meanch[,2]),linetype=2)+                           
  xlab("Canopy Height (m)")

#' 
#' 
#' 
## ---- warning=FALSE------------------------------------------------------------------------------------------------------------------------------------
# compare biomass
ggplot(filter(df_plots,layer=="Biomass"))+
  geom_raster(aes(x=x,y=y,fill=value))+
  scale_fill_distiller( palette = "YlGn", direction = 1, name="Biomass")+ 
  facet_wrap(~plotID, scale="free_x")

#' 
#' 
## ---- message=FALSE------------------------------------------------------------------------------------------------------------------------------------
# compare the histogram of biomass
ggplot(data=filter(df_plots,layer=="Biomass"),aes(x=value, fill=plotID)) + 
  geom_histogram(position="identity",alpha=0.5)+
  xlab("Biomass")

#' 
#' 
#' What about the differences of all five water indices between two plots?
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
r1=brick("09_data/HARV_2017/HARV_2017_WaterIndices.grd")

# extract plot data
harv_036=crop(r1,extent(HARV_plots[1,])) 
harv_045=crop(r1,extent(HARV_plots[2,])) 


# convert raster data to data frame, then combine data of two plots
df_plots1=rbind(as.data.frame(harv_036,xy=T,long=T), as.data.frame(harv_045,xy=T,long=T))
head(df_plots1)
dim(df_plots1)

# add plotID column
df_plots1$plotID=c(rep("HARV_036",40*40*5),rep("HARV_045",40*40*5))

#' 
#' 
## ---- message=FALSE------------------------------------------------------------------------------------------------------------------------------------
# compare all five water indices
ggplot(df_plots1)+
  geom_raster(aes(x=x,y=y,fill=value))+
  scale_fill_distiller( palette = "RdYlBu", direction = 1, name="Water Indices")+
  facet_grid(layer~plotID, scale="free_x")

#' 
#' 
## ---- message=FALSE------------------------------------------------------------------------------------------------------------------------------------
# compare all five water indices in histograms
ggplot(data=df_plots1,aes(x=value, fill=plotID)) + 
  geom_histogram(position="identity",alpha=0.5)+
  facet_wrap(~layer, scale="free_x")+
  theme(legend.position=c(0.8, 0.25))+
  xlab("Water Indices")

#' 
#' - Question: Can you do similar comparisons for Vegetation Indices?
#' 
#' - Question: Other than histograms, what else types of graphs can we use for comparisons between observatory plots?
#' 
#' 
#' ## Examine relationships
#' 
#' One of main goals in monitoring biological and ecological processes at NEON sites is to examine the relationships between environmental conditions and these processes, so that we can understand how enviornmental changes affect ecosystems and predict how ecosystems respond to future changes.
#' 
#' Here we would like to examine the relationships between topography and biological status of deciduous forests, such as canopy height, water content, etc.
#' 
#' We have seen the biomass and water indices of two plots. What about topography (e.g. elevation, slope and aspect)?
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# graph elevation, slope and aspect of two plots

# elevation
ggplot(filter(df_plots,layer =="DSM"))+
  geom_raster(aes(x=x,y=y,fill=value))+
  scale_fill_gradientn( colours=terrain.colors(100),
    name="Elevation (m)")+
  facet_wrap(~plotID, scale="free_x")

#' 
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# slope
ggplot(filter(df_plots,layer =="slope"))+
  geom_raster(aes(x=x,y=y,fill=value))+
  scale_fill_distiller( palette = "YlGnBu", direction = 1, name="slope")+ 
  facet_wrap(~plotID, scale="free_x")

#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# aspect
ggplot(filter(df_plots,layer =="aspect"))+
  geom_raster(aes(x=x,y=y,fill=value))+
  scale_fill_distiller( palette = "YlOrRd", direction = 1, name="aspect")+ 
  facet_wrap(~plotID, scale="free_x")

#' 
#' 
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# in order to make scatter plots, it would be better to convert the data frame from long format to wide format
df_plots_w=df_plots %>% 
  spread(layer, value)

# combine water indices 
df_plots_w=left_join(df_plots_w,spread(df_plots1,layer,value))

head(df_plots_w)

#' 
#' 
#' We can examine the relationships when the data frame is ready. As we have two plots at different locations, the site effect may lead to differences in the relationships. So we will group the data of two plots and use two different colors in scatter plots.
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# relationship between elevation and canopy height
ggplot(df_plots_w,aes(x=DSM, y=CHM,col=plotID))+
  geom_point()+
  geom_smooth(method="lm")+
  xlab("elevation")+ylab("canopy height")
  

#' 
#' - Question: What is the relationship between elevation and canopy height at the two plots of deciduous forest? Any difference in relationship between two plots?
#' 
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
# relationship between elevation and biomass
ggplot(df_plots_w,aes(x=DSM, y=Biomass,col=plotID))+
  geom_point()+
  geom_smooth(method="lm")+
  xlab("elevation")+ylab("biomass")
  

#' 
#' 
#' - Question: What is the relationship between elevation and biomass at the two plots of deciduous forest? Any difference in relationship between two plots?
#' 
#' 
#' To quantify the relationships, you may develop simple linear regression to see the coefficient, R-square value and p-value.
#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
lm1=lm(CHM~DSM,data=filter(df_plots_w,plotID=="HARV_036"))
summary(lm1)

#' 
## ------------------------------------------------------------------------------------------------------------------------------------------------------
lm2=lm(CHM~DSM,data=filter(df_plots_w,plotID=="HARV_045"))
summary(lm2)

#' 
#' 
#' It is your turn to examine the relationship between elevation and NDWI water indices. What do you find?
#' 
#' Write your code below:
#' 
#' ***
#' ***
#' ***
#' 
#' 
#' 
#' 
# relationship between elevation and NDWI
ggplot(df_plots_w,aes(x=DSM, y=NDWI,col=plotID))+
  geom_point()+
  geom_smooth(method="lm")+
  xlab("elevation")+ylab("NDWI")

lm3=lm(NDWI~DSM, data=df_plots_w)
summary(lm3)

#' 
#' 
#' 
#' If you are curious, you may also examine relationships between any other variables. It would be better to have a hypothesis, so that you can test your hypothesis during examination. Be careful when you make conclusions of what you find, as correlation does not necessarily mean causation.
#' 
#'