---
title: 'Lab 6 Overview: Small Mammals Data Management and Analysis'
author: "Marguerite Mauritz for Ecosystem Ecology BIOL 4466/5301"
date: "10/13/2020"
output:
  html_document:
     theme: spacelab 
     toc: true
     toc_depth: 2
     number_sections: true
     toc_float:
       collapsed: true
       smooth_scroll: true
---

# Instructions:  

* Watch the two Data in Ecology videos from Sarah McCord - this will give useful bigger picture background.  
* Download data and metadata files, Lab6_Description.html, Lab6_Working.Rmd and put them in a Lab6 folder.  
* Read through the Lab6_Description.html
* The Lab6_Working.Rmd has all the background removed and you will be creating some of your own chunks to read in the data, check the files, make some graphs. The Lab6_Working.Rmd has minimal code, if you get stuck the background file contains code.  
* This lab includes: 
    + reflection on data collection  
    + organising data from raw to analysis-ready in Excel and R  
    + working with NEON metadata, protocols, and data to understand a dataset  
    + making graphs with small mammal data  

# Introduction

Data skills are at the core of ecological research and many other professions. The data we collect ourselves or gather from public data repositories, networks, and other sources, often requires organisation before it can be analysed. Knowledge of how to structure data and what types of data formats you might encounter is a crucial skill!  This lab builds skills for data management, spreadsheet organisation, and analysis. In this lab you will examine the occurence of small mammals (ANIMALS!) in two habitat types using mark-recapture data from the National Ecological Observatory Network (NEON).

# NEON

The National Ecological Observatory Network is a US-wide project funded by the National Science Foundation (NSF) to collect thirty years of key ecological data across the major ecosystem types in the US. The purpose of NEON is to provide standardized data to the scientific community, on a subset of important ecological indicators. All data is available to the public and NEON has invested in tools and training to develop essential skills for large-scale and integrated ecological studies. Learn more at the [NEON website](https://www.neonscience.org) and in this [intro video](https://www.youtube.com/watch?v=39YrzpxVRF8&feature=youtu.be).  

# Small Mammal Research

Small mammals are widespread and important parts of ecosystems even though we often think of them as pests when they live unchecked in human environments. Ecologically, small mammals play a role as grazers, predators, insectivores. They can keep pest populations in check by eating grubs, they cycle nutrients, they disperse seeds. They are also an essential foundation of many food webs as a food source for many larger animals - hawks, owls, snakes, foxes, coyotes, wolves. Small mammals have rapid lifecycles and respond quickly to environmental changes making them useful indicators of general ecosystem health. Small mammals are also an important link in the zoonotic disease cycle so their health and pathogen load has implications for many other animals. These two videos discuss trapping methods and reasons to care about small mammals.  

* National Park Service. From Field to Lab: [Small Mammal Monitoring in Denali National Park:](https://youtu.be/KvGvS8pApFE) (1:32 - 2:30 highlights small mammal trapping/handling techniques)

* University of Oxford. [The Laboratory with Leaves (Part 10): Small Mammals:](https://youtu.be/bIjva3pa2YA) (This video provides context for why small mammal monitoring is important to ecology in general).

**Don't try this at home. Animal reserach is always done with animal welfare review and permission to ensure safe and humane treatment of animals and minimize harm to the lowest possible extent necesary for the research!**  

# Core Skills

This lab focuses on skills for:  

* Spreadsheet organisation for efficient analysis and data transfer from field to analysis  
* Importing data to R and conduct quality control and prepare data for analysis  
* Conducting analysis using large, open access dataset to address specific research question  

# R Skills

* Build on dplyr, ggplot2, lubridate skills  
* Create your own R chunks using skills developed in previous labs  
* Import data  
* Use R for data QC and checking formats  
* Make graphs to illustrate results  

# Metadata

Metadata is the data that explains the data. Metadata is critical to well documented data sets, to enhance the shareability, inter-operability, longevity, and preservation of the data.  
In short, metadata allows someone else who has never worked with your dataset to understand the data without needing you to be there.  
Good metadata describes what each data column contains, what abbreviations stand for, measurment units, how missing data values are recorded, the time interval and location of measurement, the methods for data collection, instrument calibration or accuracy, who collected the data, whether data collection is ongoing or completed.  
![Ideal Data from Sarah McCord](Pic_DataIdeal.png)

# Adaptation

The data is based on McNeil and Jones [^1] and the data skills build on a Data Carpentry module by Bahlai and Teal [^2]. This lab has been adapted for R from Hernández-Pacheco 2018 [^3].

# The Data Sets

## Data Carpentry:
**small_mammal_community.xls** – Subset of the small mammal data from southern Arizona addressing the effects of rodents and ants on the plant community. This .xls contains two years of small mammal community data with multiple data table formats. This file should be used to identify common errors in formatting data tables and re-organize the data as recommended.  
 
## The National Ecological Observatory Network: 
**Abbreviated NEON Small Mammal Trapping Protocols.docx** - An abbreviated version of the Small Mammal Trapping Protocol to highlight the methods used to trap, record, mark, and release the animals.  
**NEONSmallMammal_SCBI_BlankDataSheet.pdf** - Field data collection sheet.    
**NEON.D02.SCBI.DP1.10072.001_variables.csv** – Metadata file for NEON small mammal data (DP1.10072.001) describing the variable names.  
**NEON.D02.SCBI.DP1.10072.001.readme.txt** – Metadata file for the NEON small mammal data (DP1.1072.001) providing more information on the data product.  
**NEON.D02.SCBI.DP1.10072.001.mam_pertrapnight.072014to052015.csv** – This file is a NEON small mammal trapping data file from July 2014 to May 2015 at the SCBI Site which can be found on the NEON Field Sites list and [here](https://www.neonscience.org/field-sites/field-sites-map/SCBI). 

# Exercise 1: Reflection

* If you have collected data before, how was it collected? (Eg: Paper sheets, Tablet, Data Logger)  
* How was that data transferred to a more permanent format?  
* Has anyone taught or trained you how to collect data?  
* If you have set up your own data collection sheets, how did you 'know' what to do? (Eg: followed demonstration by someone else, got given a protocol, lab has an existing procedure, discussion with others)  
* Does your data have a back-up system in place?  
* Have you ever archived data in a public database? Do you plan to if you are currently collecting data? Did you know that *any* data collector affiliated with a scientific project can archive their data?  
* If you were to use data from a public database, what are a few minimum things you would need to know before trusting the data?  

# Exercise 2: Format data in spreadsheets for effective use (in Excel)

![Data Transfer](Pic_Data.png)
**Do the following:**  

* Open the survey data file on small mammal community in southern Arizona in excel  
* You can see that there are two tabs. Two field assistants conducted the surveys, one in 2013 and one in 2014. Now you’re the person in charge of this project and you want to be able to start analyzing the data  
* Identify what is wrong with these spreadsheets.  
* Discuss the steps you would need to take to clean up the 2013 and 2014 tabs, and to put them all together in one spreadsheet.  
* Create a header for this new spreadsheet and organize the data following the discussed rules.  
* Discuss what type each column is: numeric, integer (dbl in dplyr), character string?  

**Important:** Do not forget the first piece of advice: create a new file (or tab) for the cleaned data, never modify your original (raw) data.  

# Exercise 3: Carry out basic quality control and assurance on your spreadsheet (in R)

*Quality Assurance* are techniques and processes to ensure that data are collected in a correct way  

*Quality Control* are techniques and processes that ensure the collected data are up to standards and good for analysis  

![Data Quality Assurance and Quality Control Cycle by Sarah McCord](Pic_McCord_DataCycle.png)

### You will create your own R chunks
### Load libraries: `readxl`, `dplyr`,`ggplot2`,`lubridate`
```{r, message=FALSE}
library(readxl)
library(dplyr)
library(ggplot2)
library(lubridate)
```

### Import the data

* Call it dat.format  
* If using `read.csv()` then specify NA with `na.strings=` and if using `read_excel()` then specify with `na=`. Remember to put a character in quotes, eg: `na="NA"`, or `na.strings="NA"`  
* NOTE: `read_excel()` will format the date column to a date format automatically, if the dates are in mm/dd/yy format in excel then it should be interpreted correctly when importing.   

```{r}
dat.format <- read_excel("formatted_small_mammal_comm-1707.xlsx", na="NA")
dat.format %>% glimpse
```

### Is the imported data correct and ready for analysis?

* Use glimpse to check on the data  
* Use `arrange()` in a pipe to sort the data by the Weight_grams column  
* The `summmary()` command is also very handy for quick data checks. Use `summary()` on the data  
* Look at all the output in the Rmd work space, notice anything strange?  
```{r}
dat.format %>% arrange(Weight_grams)
```

```{r}
summary(dat.format)
```

**Answer:**  

* What are the date max and min?  
* Do the Plot IDs make sense?  
* The weights?    
* Does the output match your expectations? Why or why not? 

### Change the character format to a factor:  
*Notice  that Species, Sex, and Calibrated_Scale are characters and so summary doesn't really give us any information other than the class and length.*

* Use `as.factor(VARIABLE)`, `mutate()` and a pipe, to change the column formats of Species, Sex, Calibrated_scale; keep the column names the same    
* Remember that to retain the changes in the dataframe you need to assign them to the dataframe (ie: dat.format <- )   
* use `summary(dat.format)` again  
**Answer:**  

* What additional information do you get from changing Species, Sex, and Calibrated_Scale to a factor?  
```{r}
dat.format <- dat.format %>%
  mutate (Species = factor(Species),
          Sex = factor(Sex),
          Calibrated_Scale = factor(Calibrated_Scale))

summary(dat.format)
```

### Another way to view factor information. 
* When a factor column contains many levels, it can also be useful to use the `levels()` command. The `levels()` command is from base R and does not work with pipes. Instead you have to reference the intended column using the data$column syntax. Look at the levels like this:
```{r, echo=TRUE, eval=FALSE}
levels(dat.format$Species)
```

### Graph the distribution of weights by species  
**If everything looks good, great! If you noticed errors then you would have to either fix them in the clean data sheet (NOT THE RAW) or come up with code to do it in R.**  

* Use `geom_boxplot()` to graph the weights for each species captured  
* NO CODE?! You can do it. Think about what should be in the x- and y-axis and then add a geom_boxplot() instead of geom_point() or geom_line() like we've done before. (Look here to see how boxplot works)[https://ggplot2.tidyverse.org/reference/geom_boxplot.html]  
* Recall that a boxplot shows the median, lower 25th and upper 27th percentile of the data  
* Don't forget titles and axis labels.  

**Answer:**  

* Briefly describe the weights of each species. Which species are most variable, which are the heaviest and lightest?  
* How is your ability to make conclusions affected by the amount of metadata given with this data file?  

# Exercise 4: Use NEON data to examine small mammal species patterns by habitat type

## Explore Data Features

### Get oriented to the data

* Read the data collection protocol (abberviated version)  
* Look at the field data collection sheet  
* Look at the metadata file  
* Look at the variable descriptions  

**Answer the following:**
* How long are the traps deployed during each sampling bout?  
* How often are the traps checked during each sampling bout?  
* What are three pieces of information that get recorded for each animal captured?  
* With the combination of protocols, field collection sheet, and metadata of the digital files do you feel you have a good enough basis to join the field sampling team and help them collect data? Why or why not? (*Of course in reality you'd also need some training on how to safely handle small mammals because they BITE*)  
* What do you find helpful or confusing about the metadata files (eg: variable descriptions, units, enough background)?  

## Use the data to examine small mammal species distributions by habitat type
### Import data
* Use `read.csv()` with `na.strings=c("","NA")` to tell R that cells with 'NA' or blank cells should all be read as NA  
* Call it dat.neon  
```{r}
dat.neon <- read.csv("NEON.D02.SCBI.DP1.10072.001.mam_pertrapnight.072014to052015.csv",
                     na.strings=c("","NA"))

dat.neon %>% glimpse
```

### Convert date to useable date format  
* Look at the format of collectDate and determine a suitable lubdridate package function: `ymd()`, `mdy()`,`dmy()`?   
* Use `mutate()` to create a NEW date column, don't overwrite collectDate  
* In the same `mutate()` also create a month and year column from date (hint: we've used the functions called `month()` and `year()` before to extract that information from a date-object formatted date column)  
* Use `glimpse()` to assure yourself it worked (check column type and glimpse the new column content)  
```{r}
dat.neon <- dat.neon %>%
  mutate(date = mdy(collectDate),
         month = month(date),
         year = year(date))

dat.neon %>% glimpse

```


### Do the traps always catch a mammal?
* Look at the levels of trapStatus to see the conditions in which a trap can be found in the morning  

```{r}
levels(dat.neon$trapStatus)
```

### Filter the dat.neon data to contain only successfuly captures
* Create a new object dat.neon.filter so we retain the original too  
* Use `filter()` to select only the trapStatus '5 - capture'  
* Proceed with the filtered data!  
```{r}
dat.neon.filter <- dat.neon %>%
  filter(trapStatus == "5 - capture")
```

### Graphically examine the land cover types (nlcdClass) that were sampled:
```{r, echo=TRUE, eval=TRUE}

ggplot(dat.neon.filter,aes(decimalLongitude,decimalLatitude,colour=factor(nlcdClass)))+
  geom_label(aes(label=plotID))+
   labs(title="Plot Sampling by Land Cover",x="Longitude(decimal degrees)",y="Latitude (decimal degrees)")+
  xlim(-78.2,-78.1)

```

### What species were captured?
* Use `levels()` to display the scientific names of all sampled species.  
* Use `levels()` to also display the four letter abbreviation (taxonID)  

**Answer:**  
* Which species is denoted by each four letter abbreviation?  
* What is the common name of each species?  
```{r}
levels(dat.neon.filter$scientificName)
levels(dat.neon.filter$taxonID)
```

### What dates were sampled?
* Levels does not work for date objects, you can use `unique()` the same way.  

**Answer:**  
* Write the range of days for each sample year and month.
```{r}
unique(dat.neon$date)
```

### Select only unique individuals for each sample month
In two steps:  
1.  
* Use `filter()` to remove all tagIDs that are NA. (hint: `is.na()` allows you to search for NA values in R and ! indicates 'not' so `!is.na()` can be used to exclude NA values)  
* Then use `distinct(year,month,plotID,tagID,.keep_all=TRUE)` to remove all individuals that were recaptured in each sample month and at each plot  
* Name this filtered dataframe dat.catch  
2.  
* Then, use dat.catch to calcuate the number of unique individuals caught for each year, month, taxonID, and nlcdClass  
* Use `group_by()` and `summarise(count=n())` to calculate the number of individuals. For `summarise(count=n())`, here () inside the `n()` remains empty.    
* Call this summarised dataframe dat.numbers  
```{r}
# remove tagIDs with NA and select only distinct captures in each year, month, plotID, and tag ID. Keep all data columns. 
dat.catch <- dat.neon.filter %>% 
  filter(!is.na(tagID))%>%
  distinct(tagID, .keep_all=TRUE)

# Calculate the number of unique individuals captured for each year, month, species, and habitat
dat.numbers <- dat.catch %>%
  group_by(year,month,taxonID,nlcdClass) %>%
  summarise(count=n())
```


### Graph species abundance by habitat type
* Use dat.numbers to graph the number of unique individuals observed in each month  
* Use `geom_col(position="dodge",width=0.5)` to make a bar graph  
* Use `facet_grid()` to show the habitat types in rows and the two sample years in columns  
```{r}
ggplot(dat.numbers, aes(month, count, fill=taxonID))+
  geom_col(position="dodge",width=0.5)+
  facet_grid(nlcdClass~year)+
  labs(title="Species abundance by habitat type at NEON SCBI",x="Month",y="Count of unique individuals")
```


* You can add `scales="free_y"` inside the `facet_grid()` command to allow the y-axes to scale to the maximum value in the figures, just be aware that the axes of the panels are different and deceptive at a quick glance! 

```{r}
ggplot(dat.numbers, aes(month, count, fill=taxonID))+
  geom_col(position="dodge",width=0.5)+
  facet_grid(nlcdClass~year, scales="free_y")+
  labs(title="Species abundance by habitat type at NEON SCBI (note y-axis scale)",x="Month",y="Count of unique individuals")
```

**Answer:**  

* Why did we have to do the filtering step in dat.catch? (*Hint: think about how our species count would be affected by recapturing the same individual over multiple sampling days.*)  
* Describe the species distributions, patterns, and abundance by habitat type.  
* Which habitat type has higher species diversity?  
* What time periods had the highest numbers of individuals captured?  
* Why might it be deceptive to compare data across two graphs that have different axis scales?  
* When might it be beneficial to compare data across two graphs that have different axis scales?


# References
[^1]: McNeil, J., Jones, M. A. (2018). Data Management using NEON Small Mammal Data with Accompanying Lesson on Mark Recapture Analysis. NEON - National Ecological Observatory Network, QUBES. doi:10.25334/Q4XH5S

[^2]: Christie Bahlai and Tracy Teal (eds): “Data Carpentry: Data Organization in Spreadsheets Ecology Lesson.” Version 2017.04.0, April 2017, http://www.datacarpentry.org/spreadsheet-ecology-lesson/, https://doi.org/10.5281/zenodo.570047

[^3]: Hernández-Pacheco, R. H. (2018). More In Depth Spreadsheet Management Adaptation of Data Management using NEON Small Mammal Data. NEON Faculty Mentoring Network, QUBES Educational Resources. doi:10.25334/Q44X4D