---
title: 'Lab 6: Small Mammals Data Management and Analysis'
author: "YOUR NAME for Ecosystem Ecology BIOL 4466/5301"
date: "10/13/2020"
output:
  html_document:
     theme: spacelab 
     toc: true
     toc_depth: 2
     number_sections: true
     toc_float:
       collapsed: true
       smooth_scroll: true
---

# Instructions:  

* Watch the two Data in Ecology videos from Sarah McCord - this will give useful bigger picture background.  
* Download data and metadata files, Lab6_Description.html, Lab6_Working.Rmd and put them in a Lab6 folder.  
* Read through the Lab6_Description.html
* The Lab6_Working.Rmd has all the background removed and you will be creating some of your own chunks to read in the data, check the files, make some graphs. The Lab6_Working.Rmd has minimal code, if you get stuck the background file contains code.  
* This lab includes: 
    + reflection on data collection  
    + organising data from raw to analysis-ready in Excel and R  
    + working with NEON metadata, protocols, and data to understand a dataset  
    + making graphs with small mammal data  


# Core Skills

This lab focuses on skills for:  

* Spreadsheet organisation for efficient analysis and data transfer from field to analysis  
* Importing data to R and conduct quality control and prepare data for analysis  
* Conducting analysis using large, open access dataset to address specific research question  

# R Skills

* Build on dplyr, ggplot2, lubridate skills  
* Create your own R chunks using skills developed in previous labs  
* Import data  
* Use R for data QC and checking formats  
* Make graphs to illustrate results  


# Adaptation

The data is based on McNeil and Jones [^1] and the data skills build on a Data Carpentry module by Bahlai and Teal [^2]. This lab has been adapted for R from Hernández-Pacheco 2018 [^3].

# The Data Sets

## Data Carpentry:
**small_mammal_community.xls** – Subset of the small mammal data from southern Arizona addressing the effects of rodents and ants on the plant community. This .xls contains two years of small mammal community data with multiple data table formats. This file should be used to identify common errors in formatting data tables and re-organize the data as recommended.  
 
## The National Ecological Observatory Network: 
**Abbreviated NEON Small Mammal Trapping Protocols.docx** - An abbreviated version of the Small Mammal Trapping Protocol to highlight the methods used to trap, record, mark, and release the animals.  
**NEONSmallMammal_SCBI_BlankDataSheet.pdf** - Field data collection sheet.    
**NEON.D02.SCBI.DP1.10072.001_variables.csv** – Metadata file for NEON small mammal data (DP1.10072.001) describing the variable names.  
**NEON.D02.SCBI.DP1.10072.001.readme.txt** – Metadata file for the NEON small mammal data (DP1.1072.001) providing more information on the data product.  
**NEON.D02.SCBI.DP1.10072.001.mam_pertrapnight.072014to052015.csv** – This file is a NEON small mammal trapping data file from July 2014 to May 2015 at the SCBI Site which can be found on the NEON Field Sites list and [here](https://www.neonscience.org/field-sites/field-sites-map/SCBI). 

# Exercise 1: Reflection

* If you have collected data before, how was it collected? (Eg: Paper sheets, Tablet, Data Logger)  
* How was that data transferred to a more permanent format?  
* Has anyone taught or trained you how to collect data?  
* If you have set up your own data collection sheets, how did you 'know' what to do? (Eg: followed demonstration by someone else, got given a protocol, lab has an existing procedure, discussion with others)  
* Does your data have a back-up system in place?  
* Have you ever archived data in a public database? Do you plan to if you are currently collecting data? Did you know that *any* data collector affiliated with a scientific project can archive their data?  
* If you were to use data from a public database, what are a few minimum things you would want to feel certain about?  

# Exercise 2: Format data in spreadsheets for effective use (in Excel)

**Do the following:**  

* Open the survey data file on small mammal community in southern Arizona in excel  
* You can see that there are two tabs. Two field assistants conducted the surveys, one in 2013 and one in 2014. Now you’re the person in charge of this project and you want to be able to start analyzing the data  
* Identify what is wrong with these spreadsheets.  
* Discuss the steps you would need to take to clean up the 2013 and 2014 tabs, and to put them all together in one spreadsheet.  
* Create a header for this new spreadsheet and organize the data following the discussed rules.  
* Discuss what type each column is: numeric, integer (dbl in dplyr), character string?  

**Important:** Do not forget the first piece of advice: create a new file (or tab) for the cleaned data, never modify your original (raw) data.  

# Exercise 3: Carry out basic quality control and assurance on your spreadsheet (in R)

*Quality Assurance* are techniques and processes to ensure that data are collected in a correct way  

*Quality Control* are techniques and processes that ensure the collected data are up to standards and good for analysis  


### You will create your own R chunks
### Load libraries: `readxl`, `dplyr`,`ggplot2`,`lubridate`
```{r, load libraries, message=FALSE}

```

### Import the data

* Call it dat.format  
* If using `read.csv()` then specify NA with `na.strings=` and if using `read_excel()` then specify with `na=`. Remember to put a character in quotes, eg: `na="NA"`, or `na.strings="NA"`  
* NOTE: `read_excel()` will format the date column to a date format automatically, if the dates are in mm/dd/yy format in excel then it should be interpreted correctly when importing.   

```{r, import Arizona small mammal data}
dat.format <- 

```

### Is the imported data correct and ready for analysis?

* Use glimpse to check on the data  
* Use `arrange()` in a pipe to sort the data by the Weight_grams column  
* The `summmary()` command is also very handy for quick data checks. Use `summary()` on the data  
* Look at all the output in the Rmd work space, notice anything strange?  
```{r, Arizona mammals: glimpse data and then view arranged by weight}

```

```{r, Arizona small mammals: use summary to check the data}

```

**Answer:**  

* What are the date max and min?  
* Do the Plot IDs make sense?  
* The weights?    
* Does the output match your expectations? Why or why not? 

### Change the character format to a factor:  
*Notice  that Species, Sex, and Calibrated_Scale are characters and so summary doesn't really give us any information other than the class and length.*

* Use `as.factor(VARIABLE)`, `mutate()` and a pipe, to change the column formats of Species, Sex, Calibrated_scale; keep the column names the same    
* Remember that to retain the changes in the dataframe you need to assign them to the dataframe (ie: dat.format <- )   
* use `summary(dat.format)` again to see what changed  
**Answer:**  

* What additional information do you get from changing Species, Sex, and Calibrated_Scale to a factor?  
```{r, Arizona small mammals: change variables to factors}
dat.format <- dat.format %>%
  
  
```

### Another way to view factor information. 
* When a factor column contains many levels, it can also be useful to use the `levels()` command. The `levels()` command is form base R so it does not work with pipes. Instead you have to reference the intended column using the data$column syntax. Look at the levels like this:
```{r, Arizona mammals: check levels}

```

### Graph the distribution of weights by species  
**If everything looks good, great! If you noticed errors then you would have to either fix them in the clean data sheet (NOT THE RAW) or come up with code to do it in R.**  

* Use `geom_boxplot()` to graph the weights for each species captured  
* NO CODE?! You can do it. Think about what should be in the x- and y-axis and then add a geom_boxplot() instead of geom_point() or geom_line() like we've done before. (Look here to see how boxplot works)[https://ggplot2.tidyverse.org/reference/geom_boxplot.html]  
* Recall that a boxplot shows the median, lower 25th and upper 27th percentile of the data  
* Don't forget titles and axis labels.  

```{r, Arizona small mammals: boxplots}

```

**Answer:**  

* Briefly describe the weights of each species. Which species are most variable, which are the heaviest and lightest?  
* How is your ability to make conclusions affected by the amount of metadata given with this data file?  

# Exercise 4: Use NEON data to examine small mammal species patterns by habitat type

## Explore Data Features

### Get oriented to the data

* Read the data collection protocol (abberviated version)  
* Look at the field data collection sheet  
* Look at the metadata file  
* Look at the variable descriptions  

**Answer the following:**
* How long are the traps deployed during each sampling bout?  
* How often are the traps checked during each sampling bout?  
* What are three pieces of information that get recorded for each animal captured?  
* With the combination of protocols, field collection sheet, and metadata of the digital files do you feel you have a good enough basis to join the field sampling team and help them collect data? Why or why not? (*Of course in reality you'd also need some training on how to safely handle small mammals because they BITE*)  
* What do you find helpful or confusing about the metadata files (eg: variable descriptions, units, enough background)?  

## Use the data to examine small mammal species distributions by habitat type
### Import data
* Use `read.csv()` with `na.strings=c("","NA")` to tell R that cells with 'NA' or blank cells should all be read as NA  
* use glimpse to check the import  
* Call it dat.neon  
```{r, read in NEON data}
dat.neon <- 

dat.neon %>% glimpse
```

### Convert date to useable date format  
* Look at the format of collectDate and determine a suitable lubdridate package function: `ymd()`, `mdy()`,`dmy()`?   
* Use `mutate()` to create a NEW date column, don't overwrite collectDate  
* In the same `mutate()` also create a month and year column from date (hint: we've used the functions called month() and year() before to extract that information from a date-object formatted date column)  
* Use `glimpse()` to assure yourself it worked (check column type and glimpse the new column content)  
```{r, NEON convert to date format}
dat.neon <- dat.neon %>%
  

```


### Do the traps always catch a mammal?
* Look at the levels of trapStatus to see the conditions in which a trap can be found in the morning  

```{r, NEON data: levels of trap status}

```

### Filter the dat.neon data to contain only successfuly captures
* Create a new object dat.neon.filter so we retain the original too  
* Use `filter()` to select only the trapStatus '5 - capture'  
* Proceed with the filtered data!  
```{r, NEON data: filter for only successful captures}
dat.neon.filter <- 
  
```

### Graphically examine the land cover types (nlcdClass) that were sampled:
```{r, NEON data: graph lat/lon of sampling locations, echo=TRUE, eval=TRUE}

ggplot(dat.neon.filter,aes(decimalLongitude,decimalLatitude,colour=factor(nlcdClass)))+
  geom_label(aes(label=plotID))+
   labs(title="Plot Sampling by Land Cover",x="Longitude(decimal degrees)",y="Latitude (decimal degrees)")+
  xlim(-78.2,-78.1)

```

### What species were captured?
* Use `levels()` to display the scientific names of all sampled species.  
* Use `levels()` to also display the four letter abbreviation (taxonID)  

**Answer:**  
* Which species is denoted by each four letter abbreviation?  
* What is the common name of each species?  
```{r, NEON data: look at levels for four letter code and scientific name of species sampled}

```

### What dates were sampled?
* Levels does not work for date objects, you can use `unique()` the same way.  

**Answer:**  
* Write the range of days for each sample year and month.
```{r, NEON data: view sampling dates}
unique(dat.neon$date)
```

### Select only unique individuals for each sample month
In two steps:  
1.  
* Use `filter()` to remove all tagIDs that are NA. (hint: `is.na()` allows you to search for NA values in R and ! indicates 'not' so `!is.na()` can be used to exclude NA values)  
* Then use `distinct(tagID,.keep_all=TRUE)` to remove all individuals that were recaptured in each sample month and at each plot  
* Name this filtered dataframe dat.catch  
2.  
* Then, use dat.catch to calcuate the number of unique individuals caught for each year, month, taxonID, and nlcdClass  
* Use `group_by()` and `summarise(count=n())` to calculate the number of individuals. For `summarise(count=n())`, here () inside the `n()` remains empty.    
* Call this summarised dataframe dat.numbers  
```{r, NEON data: filter out tag ID that are NA and keep only distinct tagIDs then count individuals}
# remove tagIDs with NA and select only distinct captures in each year, month, plotID, and tag ID. Keep all data columns. 
dat.catch <- dat.neon %>%

# Calculate the number of unique individuals captured for each year, month, species, and habitat
dat.numbers <- dat.catch %>%
  
```


### Graph species abundance by habitat type
* Use dat.numbers to graph the number of unique individuals observed in each month  
* Use `geom_col(position="dodge",width=0.5)` to make a bar graph  
* Use `facet_grid()` to show the habitat types in rows and the two sample years in columns  
```{r, NEON data: graph number of individuals for each month and species}
ggplot(dat.numbers, aes(VARIABLE, VARIABLE, fill=VARIABLE))+
  geom_col(position="dodge",width=0.5)+
  facet_grid(VARIABLE~VARIABLE)+
  labs()
```


* You can add `scales="free_y"` inside the `facet_grid()` command to allow the y-axes to scale to the maximum value in the figures, just be aware that the axes of the panels are different and deceptive at a quick glance! 

```{r, NEON data: add scales="free_y" to facet_grid}


```

**Answer:**  

* Why did we have to do the filtering step in dat.catch? (*Hint: think about how our species count would be affected by recapturing the same individual over multiple sampling days.*)  
* Describe the species distributions, patterns, and abundance by habitat type.  
* Which habitat type has higher species diversity?  
* What time periods had the highest numbers of individuals captured?  
* Why might it be deceptive to compare data across two graphs that have different axis scales?  
* When might it be beneficial to compare data across two graphs that have different axis scales?  


# References
[^1]: McNeil, J., Jones, M. A. (2018). Data Management using NEON Small Mammal Data with Accompanying Lesson on Mark Recapture Analysis. NEON - National Ecological Observatory Network, QUBES. doi:10.25334/Q4XH5S

[^2]: Christie Bahlai and Tracy Teal (eds): “Data Carpentry: Data Organization in Spreadsheets Ecology Lesson.” Version 2017.04.0, April 2017, http://www.datacarpentry.org/spreadsheet-ecology-lesson/, https://doi.org/10.5281/zenodo.570047

[^3]: Hernández-Pacheco, R. H. (2018). More In Depth Spreadsheet Management Adaptation of Data Management using NEON Small Mammal Data. NEON Faculty Mentoring Network, QUBES Educational Resources. doi:10.25334/Q44X4D