Instructions

1) Before continuing with this lesson, please first review the associated documents: “RHerringBackground.docx” and “RHerringBackground.mp4”
2) This lesson can be utilized by students with less experience in R Studio by either pressing the “Knit” button located in the toolbar above, or by simply opening the attached “WalkersDamHerring.html” file
3) For those wishing to work within the R-Markdown script, code lines may be executed one at a time using ctrl+enter when the cursor is active on that line. Alternatively, execute entire code chunks using the green “play” button in the top right of each code chunk
4) Estimated Time For Completion: 20 minutes

Setting the Stage

As previously mentioned in HerringBackground.docx, River Herring migrate up the Chickahominy River in Virginia each spring in order to reproduce. During this migration, these fish are subject to increased rates of mortality from a number of factors. Although river herring currently exist at historically low abundances, many of the methods used to monitor their abundance actually results in additional mortality. In an effort to reduce this undesirable side-effect, passive sampling methods such as electronic fish counters are being implemented to alleviate some of the unnecessary river herring mortalility. However, as these methods and gears are relatively novel, the efficacy of such techniques are unknown. This is where you come in.
In this lesson, you will use data that was collected at the Walker’s Fishway in 2020 by fisheries biologists from the Virginia Department of Wildlife Resources.Your objectives in this lesson are:
1) Determine whether the fish counts derived from the electronic counter at the Walker’s Fishway on the Chickahominy River are statistically valid.
2) Plot the daily migration of river herring that passed through the fishway.
3) Answer the questions at the end of this lesson to test your acquired skills
To help you accomplish this task, you will be given two datasets. The first dataset, “TrapDataV2.xlsx”, consists of 212 observations in which the electronic count was directly compared to the number of fish that actually passed through the fishway. These fish counts occurred across the same time interval, and are split by the columns SRCount(the electronic count) and TrapCount (the number of fish physically observed to have passed). The second dataset, “DailyCounts_20.xlsx”, is comprised of columns with the number of river herring counted by the electronic fish counter for each day that it was running River Herring, and the associated Date that count occured. As the fisheries biologist working on this issue, your analysis and interpretation of this data will result in the reduction or continuation of active sampling techniques; either reducing or accepting the incidental mortality associated with monitoring populations of migrating river herring.

a) Check to see that you have the appropriate libraries loaded. Please note that if a library is not found, you may first need to install the source package by searching under the same name.

Visualize The Data

Let’s start this lesson by visualizing the data we aim to test. To do so we’ve used the ggplot function to construct a line plot of the Electronic Fish Counts (experimental) versus the Trap Fish Counts (Control).

Although our visualization shows that the electronic counts and the physical fish counts appear to be correlated, we also see instances in which the methods have noticeable disparities between the counts. Although our visualization shows that the electronic counts and the physical fish counts appear to be correlated, we also see instances in which the methods have noticeable disparities between the counts. While this graphic is useful for basic inferences, it does not allow us to make statistical inferences as to whether the electronic counts are statistically linked. To do this, more sophisticated quantitative techniques are required.

Note: You can isolate plot outputs for higher quality rendering by clicking “Show in New Window”, which is located in the top-right portion of the plot’s output frame.

Parametric versus Non-Parametric Statistical Testing

Data Normality

In order to perform many statistical tests, your data must meet a number of assumptions for that test. For some of the most common tests performed, such as simple linear regression, Pearson correlation, and T-tests, one of the primary assumptions is that of data normality. These types of tests are referred to as Parametric Statistics, and only work properly if data is classified as normal. As data becomes less normalize, these tests increase in bias, and at some point, lose all validity. Before we test our own data for normality, lets visualize what normal data looks like. Perfectly normal data resembles a bell-shaped distribution curve, and consequently, have identical values for the mean, median, and mode. In lines 82 through 89, we use R to visually display a perfectly normal dataset.

Now that we know what a normal distribution visually looks like, we can perform a similar visualization to assess our own variables.

For both variables, we see that our data is very positively skewed, or skewed to the right. As both variables are derived from “Count” data, this is expected, as counts are often not of a continuous nature. Therefore, we can visually infer that our variables are not normally distributed, and cannot be currently assessed using parametric statistics. We can validate this inference statistically by using a Shapiro-Wilk Test for normality, as performed below (Lines 111 through 114). To interpret the results of this test, remember that the NULL HYPOTHESIS of this test states that our assessed variable IS NORMALLY DISTRIBUTED. In order to accept, or “fail to reject” this hypothesis, which we wish to do, the resulting p-value must be greater than 0.05. In both of our variables, we see that our p-value is much smaller than 0.05 (< 2.2e-16); so we reject the NULL, and confirm our visual inferences that the data is indeed not normal.

## 
##  Shapiro-Wilk normality test
## 
## data:  Trap$SRCount
## W = 0.63937, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  Trap$TrapCount
## W = 0.55246, p-value < 2.2e-16

Data Transformation

So what can be done if our data is not normal? Is it useless? Probably not. One of the more common techniques used to address data normality involves transforming the non-normalized variables, which adjusts the shape of that variable’s distribution. In our case, we have previously observed that both of our variables are positively skewed. For positively skewed distributions, a Log Transformation is perhaps the most common transformation technique used. This works by taking the original value, and replacing it with the logarithm of that value. R allows us to complete this task quickly, which we will do below:

We will now follow that transformation by replotting the distribution of our variables, and re-running the Shapiro-Wilk Test from above to see if our data is normal, and can be analyzed using parametric statistics.

## 
##  Shapiro-Wilk normality test
## 
## data:  TrapTransformed$ElectronicLog
## W = 0.97743, p-value = 0.001755

## 
##  Shapiro-Wilk normality test
## 
## data:  TrapTransformed$TrapLog
## W = 0.97856, p-value = 0.002552

While it is both visually and statistically (both p-values increased) evident that this transformation greatly improved the normality of our variables. We still lack the threshold (p-value > 0.05) to perform a parametric test without introducing an unknown level of bias.

Nonparametric Alternatives

Although many other data transformation techniques exist, let’s just pretend we’ve tried them all – without any of the transformations resulting in a normal distribution for our variables. What other options do we have? One common alternative is to bypass the assumption of normality by using a nonparametric statistical test. While parametric tests are used often more powerful, particularly in studies with a limited sample size, nonparametric alternatives exist for every parametric test. In our case, we have many samples in our data, and therefore we can properly implement the use of a the Paired Samples Wilcoxon Signed Rank Test, a nonparametric alternative of the more common parametric paired t-test.

Similar to the t-test, the Paired Samples Wilcoxon Signed Rank Test will pair the values of the two variables – as they are technically the same fish being measured twice (electronically and physically). The test then assigns a rank value for the absolute difference between each pair, and ranks them in order from the least different to the most different. The least different score (in theory, a difference of 0) is assigned the rank of 1. If multiple pairs have the same difference, as is present in our situation where we have multiple pairs that completely match, the rank for all of the pairs is assigned to the average in which those ranks span. As an example, let’s say we have 3 observation intervals that each completely match. These pairs would be assigned as the 1st, 2nd, and 3rd ranked pairs – but would all receive a score of 2 (the average of 1+2+3).

Remember that the signed rank can also be reported as a negative value, as the variation between the variables is either positive or negative. The test statistic, reported as V in R Studio, is the sum of the signed ranks, and in a test of two variables with the same values, would be reported as V = 0. In practical application, the V score is only indicative of the degree of variation when comparing tests with the same number of variables. Because we know our variables already differ, and we are only interested in whether they are statistically different, we can not use the v score to make any interpretation.

For our analysis, the more useful output for us to focus on is the reported p-value of the test. The alternative hypothesis of the Paired Samples Wilcoxon Signed Rank Test states that the difference between the two variables IS NOT ZERO. In order to “fail to reject” this hypothesis, we would therefore require a p-value of < 0.05. In this case, our p-value is 0.2501, which means that we accept our NULL HYPOTHESIS - that the difference between the electronic fish counts and physical trap count is statistically insignificant. The true power of R is revealed below, in which the entire Paired Samples Wilcoxon Signed Rank Test is conducted almost instantaneously in Lines 151 to 155 below:

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Trap$SRCount and Trap$TrapCount
## V = 6940.5, p-value = 0.2501
## alternative hypothesis: true location shift is not equal to 0

Plotting Daily River Herring Passage

Now that we’ve statistically validated our electronic fish counts, let’s use the plotting functions we learned above to graph the daily total count of river herring detected by the electronic fish counter.

Questions

Now it is time to combine the knowledge you’ve gained from this lesson with the background information given in the other documents. You are presented with the following questions:

Question 1: Why is it important to check for normality before running parametric statistics?

Question 2: Pretend you are consulted for input on whether or not recreational fishing should be temporarily closed near the Walker’s Fishway in order to allow river herring to migrate without additional harassment. Which week(s) or month(s) would you choose to close, and how would you justify this decision?

Question 3: Let’s pretend that we were only interested in whether or not the electronic fish counts overestimated river herring migration. We test this by manipulating our Paired Samples Wilcoxon Test as seen below. Does the electronic counter overestimate fish at a statistically significant threshold?

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  Trap$SRCount and Trap$TrapCount
## V = 6940.5, p-value = 0.1251
## alternative hypothesis: true location shift is greater than 0

Question 4: In this lesson, we highlighted mortality reduction as the primary advantage to using an electronic fish counter over more invasive fish monitoring gears. Can you think of any other advantageous qualities?