RHerringLesson.utf8

Instructions
Setting the Stage
Visualize The Data
Parametric versus Non-Parametric Statistical Testing
Plotting Daily River Herring Passage
Questions

Instructions

1) Before continuing with this lesson, please first review the associated documents: “RHerringBackground.docx” and “RHerringBackground.mp4”

2) This lesson can be utilized by students with less experience in R Studio by either pressing the “Knit” button located in the toolbar above, or by simply opening the attached “WalkersDamHerring.html” file

3) For those wishing to work within the R-Markdown script, code lines may be executed one at a time using ctrl+enter when the cursor is active on that line. Alternatively, execute entire code chunks using the green “play” button in the top right of each code chunk

4) Estimated Time For Completion: 20 minutes

Setting the Stage

As previously mentioned in HerringBackground.docx, River Herring migrate up the Chickahominy River in Virginia each spring in order to reproduce. During this migration, these fish are subject to increased rates of mortality from a number of factors. Although river herring currently exist at historically low abundances, many of the methods used to monitor their abundance actually results in additional mortality. In an effort to reduce this undesirable side-effect, passive sampling methods such as electronic fish counters are being implemented to alleviate some of the unnecessary river herring mortalility. However, as these methods and gears are relatively novel, the efficacy of such techniques are unknown. This is where you come in.

In this lesson, you will use data that was collected at the Walker’s Fishway in 2020 by fisheries biologists from the Virginia Department of Wildlife Resources.Your objectives in this lesson are:

1) Determine whether the fish counts derived from the electronic counter at the Walker’s Fishway on the Chickahominy River are statistically valid.

2) Plot the daily migration of river herring that passed through the fishway.

3) Answer the questions at the end of this lesson to test your acquired skills

To help you accomplish this task, you will be given two datasets. The first dataset, “TrapDataV2.xlsx”, consists of 212 observations in which the electronic count was directly compared to the number of fish that actually passed through the fishway. These fish counts occurred across the same time interval, and are split by the columns SRCount(the electronic count) and TrapCount (the number of fish physically observed to have passed). The second dataset, “DailyCounts_20.xlsx”, is comprised of columns with the number of river herring counted by the electronic fish counter for each day that it was running River Herring, and the associated Date that count occured. As the fisheries biologist working on this issue, your analysis and interpretation of this data will result in the reduction or continuation of active sampling techniques; either reducing or accepting the incidental mortality associated with monitoring populations of migrating river herring.

a) Check to see that you have the appropriate libraries loaded. Please note that if a library is not found, you may first need to install the source package by searching under the same name.

b) Import the Associated Datasets

Trap <- read_excel("TrapDataV2.xlsx")   #Load TrapDataV2 into R Studio, and rename it to "Trap"
Counts <- read_excel("DailyCounts_20.xlsx")    #Load DailyCounts_20 into R Studio, and rename it to "Counts"

Visualize The Data

Let’s start this lesson by visualizing the data we aim to test. To do so we’ve used the ggplot function to construct a line plot of the Electronic Fish Counts (experimental) versus the Trap Fish Counts (Control).

ggplot(data=Trap)+    
#Call the ggplot command, and associate the command with the Trap dataset
  geom_line(size = 1.2, alpha = 0.7, aes(y=SRCount,x= Obs,color="Electronic")) +   
#Create a new line, with a thickness of 1.2, and a transparency of 0.7. Visually, we want the Y axis of this line to be defined by the Electronic "SRCount", and the X of this line to be defined by the Observation. The 
  geom_line(size = 1.2, alpha = 0.7, aes(y=TrapCount,x= Obs,color="Trap")) +       
#Similar to the previous command, but completed for the Pysical Fish Count "TrapCount"
  scale_color_manual(values = c(     
    'Electronic' = 'darkred',
    'Trap' = 'steelblue'))+
#Here we specify the colors for each of the created lines
  labs(color = 'Count Method')+
#This allowed us to generate a legend for the chart, as defined by the color of our lines
  ggtitle("Fish Count Comparisons - Electronic Fish Counter vs. Trap Observations ")+
#We gave the visualization a Title
  xlab("Interval")+
  ylab("Total Fish Counted")

#And we labeled each of the graphs axes

Although our visualization shows that the electronic counts and the physical fish counts appear to be correlated, we also see instances in which the methods have noticeable disparities between the counts. Although our visualization shows that the electronic counts and the physical fish counts appear to be correlated, we also see instances in which the methods have noticeable disparities between the counts. While this graphic is useful for basic inferences, it does not allow us to make statistical inferences as to whether the electronic counts are statistically linked. To do this, more sophisticated quantitative techniques are required.

Note: You can isolate plot outputs for higher quality rendering by clicking “Show in New Window”, which is located in the top-right portion of the plot’s output frame.

Parametric versus Non-Parametric Statistical Testing

Data Normality

In order to perform many statistical tests, your data must meet a number of assumptions for that test. For some of the most common tests performed, such as simple linear regression, Pearson correlation, and T-tests, one of the primary assumptions is that of data normality. These types of tests are referred to as Parametric Statistics, and only work properly if data is classified as normal. As data becomes less normalize, these tests increase in bias, and at some point, lose all validity. Before we test our own data for normality, lets visualize what normal data looks like. Perfectly normal data resembles a bell-shaped distribution curve, and consequently, have identical values for the mean, median, and mode. In lines 82 through 89, we use R to visually display a perfectly normal dataset.

x = seq(-15, 15, by=0.1)
# We generate a sequence of x values from negative 15 to positive 15, by an interval of 0.1
y = dnorm(x, mean(x), sd(x))
# We generate a sequence of y values to be normal, using the previously constructed x sequence, and define the mean and standard deviation
plot(x, y)

# And finally, we construct a very basic plot

Now that we know what a normal distribution visually looks like, we can perform a similar visualization to assess our own variables.

ggdensity(Trap$SRCount,    
#Creating a density plot using the SRCount column of the Trap data
     xlab= "Electronic Fish Count During Trap Session",
     ylab= "Frequency of Occurrence",
     main= "Distribution of Number of Electronic Fish Counted by Trap Session")

#Labeling our x axis, y axis, and title

ggdensity(Trap$TrapCount,
#Creating a density plot using the TrapCount column of the Trap data
     xlab= "Number of Captured Fish During Trap Session",
     ylab= "Frequency of Occurrence",
     main= "Distribution of Number of Fish Captured by Trap Session")

#Labeling our x axis, y axis, and title

For both variables, we see that our data is very positively skewed, or skewed to the right. As both variables are derived from “Count” data, this is expected, as counts are often not of a continuous nature. Therefore, we can visually infer that our variables are not normally distributed, and cannot be currently assessed using parametric statistics. We can validate this inference statistically by using a Shapiro-Wilk Test for normality, as performed below (Lines 111 through 114). To interpret the results of this test, remember that the NULL HYPOTHESIS of this test states that our assessed variable IS NORMALLY DISTRIBUTED. In order to accept, or “fail to reject” this hypothesis, which we wish to do, the resulting p-value must be greater than 0.05. In both of our variables, we see that our p-value is much smaller than 0.05 (< 2.2e-16); so we reject the NULL, and confirm our visual inferences that the data is indeed not normal.

shapiro.test(Trap$SRCount) #Shapiro-Wilk Test - Electronic Count

## 
##  Shapiro-Wilk normality test
## 
## data:  Trap$SRCount
## W = 0.63937, p-value < 2.2e-16

shapiro.test(Trap$TrapCount) #Shapiro-Wilk Test - Trap Count

## 
##  Shapiro-Wilk normality test
## 
## data:  Trap$TrapCount
## W = 0.55246, p-value < 2.2e-16

Data Transformation

So what can be done if our data is not normal? Is it useless? Probably not. One of the more common techniques used to address data normality involves transforming the non-normalized variables, which adjusts the shape of that variable’s distribution. In our case, we have previously observed that both of our variables are positively skewed. For positively skewed distributions, a Log Transformation is perhaps the most common transformation technique used. This works by taking the original value, and replacing it with the logarithm of that value. R allows us to complete this task quickly, which we will do below:

Trap %>%    
#Using Dplyr we select the  Trap data file we wish to work with - this is known as piping
  mutate(ElectronicLog = log10(Trap$SRCount))%>%      
#Using mutate, which creates a new data column, we specify that we want that column to be named "ElectronicLog", and it will be calculated by taking the Log(Base10) of the current Electronic Counts (SRCount). 
   mutate(TrapLog = log10(Trap$TrapCount)) -> TrapTransformed   
#We then pipe those results to our next command, which performs the same transformation on our trap data, and creates a new file to work with called "TrapTransformed"

We will now follow that transformation by replotting the distribution of our variables, and re-running the Shapiro-Wilk Test from above to see if our data is normal, and can be analyzed using parametric statistics.

ggdensity(TrapTransformed$ElectronicLog,
     xlab= "Electronic Fish Count During Trap Session",
     ylab= "Frequency of Occurrence",
     main= "Distribution of Number of Electronic Fish Counted by Trap Session")

shapiro.test(TrapTransformed$ElectronicLog)

## 
##  Shapiro-Wilk normality test
## 
## data:  TrapTransformed$ElectronicLog
## W = 0.97743, p-value = 0.001755

ggdensity(TrapTransformed$TrapLog,
     xlab= "Fish Trapped During Trap Session",
     ylab= "Frequency of Occurrence",
     main= "Distribution of Number of Trapped Fish Counted by Trap Session")

shapiro.test(TrapTransformed$TrapLog)

## 
##  Shapiro-Wilk normality test
## 
## data:  TrapTransformed$TrapLog
## W = 0.97856, p-value = 0.002552

While it is both visually and statistically (both p-values increased) evident that this transformation greatly improved the normality of our variables. We still lack the threshold (p-value > 0.05) to perform a parametric test without introducing an unknown level of bias.

Nonparametric Alternatives

Although many other data transformation techniques exist, let’s just pretend we’ve tried them all – without any of the transformations resulting in a normal distribution for our variables. What other options do we have? One common alternative is to bypass the assumption of normality by using a nonparametric statistical test. While parametric tests are used often more powerful, particularly in studies with a limited sample size, nonparametric alternatives exist for every parametric test. In our case, we have many samples in our data, and therefore we can properly implement the use of a the Paired Samples Wilcoxon Signed Rank Test, a nonparametric alternative of the more common parametric paired t-test.

Similar to the t-test, the Paired Samples Wilcoxon Signed Rank Test will pair the values of the two variables – as they are technically the same fish being measured twice (electronically and physically). The test then assigns a rank value for the absolute difference between each pair, and ranks them in order from the least different to the most different. The least different score (in theory, a difference of 0) is assigned the rank of 1. If multiple pairs have the same difference, as is present in our situation where we have multiple pairs that completely match, the rank for all of the pairs is assigned to the average in which those ranks span. As an example, let’s say we have 3 observation intervals that each completely match. These pairs would be assigned as the 1st, 2nd, and 3rd ranked pairs – but would all receive a score of 2 (the average of 1+2+3).

Remember that the signed rank can also be reported as a negative value, as the variation between the variables is either positive or negative. The test statistic, reported as V in R Studio, is the sum of the signed ranks, and in a test of two variables with the same values, would be reported as V = 0. In practical application, the V score is only indicative of the degree of variation when comparing tests with the same number of variables. Because we know our variables already differ, and we are only interested in whether they are statistically different, we can not use the v score to make any interpretation.

For our analysis, the more useful output for us to focus on is the reported p-value of the test. The alternative hypothesis of the Paired Samples Wilcoxon Signed Rank Test states that the difference between the two variables IS NOT ZERO. In order to “fail to reject” this hypothesis, we would therefore require a p-value of < 0.05. In this case, our p-value is 0.2501, which means that we accept our NULL HYPOTHESIS - that the difference between the electronic fish counts and physical trap count is statistically insignificant. The true power of R is revealed below, in which the entire Paired Samples Wilcoxon Signed Rank Test is conducted almost instantaneously in Lines 151 to 155 below:

#We will use the test to determine if there is a statistical difference in either direction - we specify this by using the "two.sided" alternative hypothesis. Our test is paired, as it is the same subjects being measured twice (once electronically, a once physically in the trap)
wilcox.test(Trap$SRCount, Trap$TrapCount, paired= TRUE, alternative = "two.sided")

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Trap$SRCount and Trap$TrapCount
## V = 6940.5, p-value = 0.2501
## alternative hypothesis: true location shift is not equal to 0

Plotting Daily River Herring Passage

Now that we’ve statistically validated our electronic fish counts, let’s use the plotting functions we learned above to graph the daily total count of river herring detected by the electronic fish counter.

#Transform the Counts data into a dataframe to allow ggplot to recognize the "Date" column
DailyPassage.df <- as.data.frame(Counts) %>% 
#Pipe the dataframe into the ggplot
ggplot()+ 
  geom_line( size = 1.2, color = "darkred", aes(x=Date, y= `River Herring`))+  
  #Add our line as a factor of river herring over time
  ggtitle("Daily River Herring Passed - Electronic Count")+  
  #Name the plot
  xlab("Date")+  
  #Name the X-axis
  ylab("Daily Herring Counted")  -> RiverHerring   
#Name the Y-axis, and export the ggplot as "RiverHerring"

RiverHerring

#Display the plot

Questions

Now it is time to combine the knowledge you’ve gained from this lesson with the background information given in the other documents. You are presented with the following questions:

Question 1: Why is it important to check for normality before running parametric statistics?

Question 2: Pretend you are consulted for input on whether or not recreational fishing should be temporarily closed near the Walker’s Fishway in order to allow river herring to migrate without additional harassment. Which week(s) or month(s) would you choose to close, and how would you justify this decision?

Question 3: Let’s pretend that we were only interested in whether or not the electronic fish counts overestimated river herring migration. We test this by manipulating our Paired Samples Wilcoxon Test as seen below. Does the electronic counter overestimate fish at a statistically significant threshold?

#Now we will use the test to determine whether
wilcox.test(Trap$SRCount, Trap$TrapCount, paired= TRUE, alternative = "greater")

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Trap$SRCount and Trap$TrapCount
## V = 6940.5, p-value = 0.1251
## alternative hypothesis: true location shift is greater than 0

Question 4: In this lesson, we highlighted mortality reduction as the primary advantage to using an electronic fish counter over more invasive fish monitoring gears. Can you think of any other advantageous qualities?

Assessing River Herring Migration Using Nonparametric Analysis

Lesson Constructed by Timothy M. Owen

Instructions

1) Before continuing with this lesson, please first review the associated documents: “RHerringBackground.docx” and “RHerringBackground.mp4”

3) For those wishing to work within the R-Markdown script, code lines may be executed one at a time using ctrl+enter when the cursor is active on that line. Alternatively, execute entire code chunks using the green “play” button in the top right of each code chunk

4) Estimated Time For Completion: 20 minutes

Setting the Stage

In this lesson, you will use data that was collected at the Walker’s Fishway in 2020 by fisheries biologists from the Virginia Department of Wildlife Resources.Your objectives in this lesson are:

1) Determine whether the fish counts derived from the electronic counter at the Walker’s Fishway on the Chickahominy River are statistically valid.

2) Plot the daily migration of river herring that passed through the fishway.

3) Answer the questions at the end of this lesson to test your acquired skills

a) Check to see that you have the appropriate libraries loaded. Please note that if a library is not found, you may first need to install the source package by searching under the same name.

b) Import the Associated Datasets

Visualize The Data

Let’s start this lesson by visualizing the data we aim to test. To do so we’ve used the ggplot function to construct a line plot of the Electronic Fish Counts (experimental) versus the Trap Fish Counts (Control).

Note: You can isolate plot outputs for higher quality rendering by clicking “Show in New Window”, which is located in the top-right portion of the plot’s output frame.

Parametric versus Non-Parametric Statistical Testing

Data Normality

Now that we know what a normal distribution visually looks like, we can perform a similar visualization to assess our own variables.

Data Transformation

We will now follow that transformation by replotting the distribution of our variables, and re-running the Shapiro-Wilk Test from above to see if our data is normal, and can be analyzed using parametric statistics.

While it is both visually and statistically (both p-values increased) evident that this transformation greatly improved the normality of our variables. We still lack the threshold (p-value > 0.05) to perform a parametric test without introducing an unknown level of bias.

Nonparametric Alternatives

Plotting Daily River Herring Passage

Now that we’ve statistically validated our electronic fish counts, let’s use the plotting functions we learned above to graph the daily total count of river herring detected by the electronic fish counter.

Questions

Now it is time to combine the knowledge you’ve gained from this lesson with the background information given in the other documents. You are presented with the following questions:

Question 1: Why is it important to check for normality before running parametric statistics?

Question 4: In this lesson, we highlighted mortality reduction as the primary advantage to using an electronic fish counter over more invasive fish monitoring gears. Can you think of any other advantageous qualities?