1) Before continuing with this lesson, please first review the associated documents: “RHerringBackground.docx” and “RHerringBackground.mp4”
2) This lesson can be utilized by students with less experience in R Studio by either pressing the “Knit” button located in the toolbar above, or by simply opening the attached “WalkersDamHerring.html” file
3) For those wishing to work within the R-Markdown script, code lines may be executed one at a time using ctrl+enter when the cursor is active on that line. Alternatively, execute entire code chunks using the green “play” button in the top right of each code chunk
4) Estimated Time For Completion: 20 minutes
As previously mentioned in HerringBackground.docx, River Herring migrate up the Chickahominy River in Virginia each spring in order to reproduce. During this migration, these fish are subject to increased rates of mortality from a number of factors. Although river herring currently exist at historically low abundances, many of the methods used to monitor their abundance actually results in additional mortality. In an effort to reduce this undesirable side-effect, passive sampling methods such as electronic fish counters are being implemented to alleviate some of the unnecessary river herring mortalility. However, as these methods and gears are relatively novel, the efficacy of such techniques are unknown. This is where you come in.
In this lesson, you will use data that was collected at the Walker’s Fishway in 2020 by fisheries biologists from the Virginia Department of Wildlife Resources.Your objectives in this lesson are:
1) Determine whether the fish counts derived from the electronic counter at the Walker’s Fishway on the Chickahominy River are statistically valid.
2) Plot the daily migration of river herring that passed through the fishway.
3) Answer the questions at the end of this lesson to test your acquired skills
To help you accomplish this task, you will be given two datasets. The first dataset, “TrapDataV2.xlsx”, consists of 212 observations in which the electronic count was directly compared to the number of fish that actually passed through the fishway. These fish counts occurred across the same time interval, and are split by the columns SRCount(the electronic count) and TrapCount (the number of fish physically observed to have passed). The second dataset, “DailyCounts_20.xlsx”, is comprised of columns with the number of river herring counted by the electronic fish counter for each day that it was running River Herring, and the associated Date that count occured. As the fisheries biologist working on this issue, your analysis and interpretation of this data will result in the reduction or continuation of active sampling techniques; either reducing or accepting the incidental mortality associated with monitoring populations of migrating river herring.
Let’s start this lesson by visualizing the data we aim to test. To do so we’ve used the ggplot function to construct a line plot of the Electronic Fish Counts (experimental) versus the Trap Fish Counts (Control).
ggplot(data=Trap)+
#Call the ggplot command, and associate the command with the Trap dataset
geom_line(size = 1.2, alpha = 0.7, aes(y=SRCount,x= Obs,color="Electronic")) +
#Create a new line, with a thickness of 1.2, and a transparency of 0.7. Visually, we want the Y axis of this line to be defined by the Electronic "SRCount", and the X of this line to be defined by the Observation. The
geom_line(size = 1.2, alpha = 0.7, aes(y=TrapCount,x= Obs,color="Trap")) +
#Similar to the previous command, but completed for the Pysical Fish Count "TrapCount"
scale_color_manual(values = c(
'Electronic' = 'darkred',
'Trap' = 'steelblue'))+
#Here we specify the colors for each of the created lines
labs(color = 'Count Method')+
#This allowed us to generate a legend for the chart, as defined by the color of our lines
ggtitle("Fish Count Comparisons - Electronic Fish Counter vs. Trap Observations ")+
#We gave the visualization a Title
xlab("Interval")+
ylab("Total Fish Counted")
Although our visualization shows that the electronic counts and the physical fish counts appear to be correlated, we also see instances in which the methods have noticeable disparities between the counts. Although our visualization shows that the electronic counts and the physical fish counts appear to be correlated, we also see instances in which the methods have noticeable disparities between the counts. While this graphic is useful for basic inferences, it does not allow us to make statistical inferences as to whether the electronic counts are statistically linked. To do this, more sophisticated quantitative techniques are required.
Note: You can isolate plot outputs for higher quality rendering by clicking “Show in New Window”, which is located in the top-right portion of the plot’s output frame.
In order to perform many statistical tests, your data must meet a number of assumptions for that test. For some of the most common tests performed, such as simple linear regression, Pearson correlation, and T-tests, one of the primary assumptions is that of data normality. These types of tests are referred to as Parametric Statistics, and only work properly if data is classified as normal. As data becomes less normalize, these tests increase in bias, and at some point, lose all validity. Before we test our own data for normality, lets visualize what normal data looks like. Perfectly normal data resembles a bell-shaped distribution curve, and consequently, have identical values for the mean, median, and mode. In lines 82 through 89, we use R to visually display a perfectly normal dataset.
x = seq(-15, 15, by=0.1)
# We generate a sequence of x values from negative 15 to positive 15, by an interval of 0.1
y = dnorm(x, mean(x), sd(x))
# We generate a sequence of y values to be normal, using the previously constructed x sequence, and define the mean and standard deviation
plot(x, y)
Now that we know what a normal distribution visually looks like, we can perform a similar visualization to assess our own variables.
ggdensity(Trap$SRCount,
#Creating a density plot using the SRCount column of the Trap data
xlab= "Electronic Fish Count During Trap Session",
ylab= "Frequency of Occurrence",
main= "Distribution of Number of Electronic Fish Counted by Trap Session")
#Labeling our x axis, y axis, and title
ggdensity(Trap$TrapCount,
#Creating a density plot using the TrapCount column of the Trap data
xlab= "Number of Captured Fish During Trap Session",
ylab= "Frequency of Occurrence",
main= "Distribution of Number of Fish Captured by Trap Session")
For both variables, we see that our data is very positively skewed, or skewed to the right. As both variables are derived from “Count” data, this is expected, as counts are often not of a continuous nature. Therefore, we can visually infer that our variables are not normally distributed, and cannot be currently assessed using parametric statistics. We can validate this inference statistically by using a Shapiro-Wilk Test for normality, as performed below (Lines 111 through 114). To interpret the results of this test, remember that the NULL HYPOTHESIS of this test states that our assessed variable IS NORMALLY DISTRIBUTED. In order to accept, or “fail to reject” this hypothesis, which we wish to do, the resulting p-value must be greater than 0.05. In both of our variables, we see that our p-value is much smaller than 0.05 (< 2.2e-16); so we reject the NULL, and confirm our visual inferences that the data is indeed not normal.
##
## Shapiro-Wilk normality test
##
## data: Trap$SRCount
## W = 0.63937, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: Trap$TrapCount
## W = 0.55246, p-value < 2.2e-16
So what can be done if our data is not normal? Is it useless? Probably not. One of the more common techniques used to address data normality involves transforming the non-normalized variables, which adjusts the shape of that variable’s distribution. In our case, we have previously observed that both of our variables are positively skewed. For positively skewed distributions, a Log Transformation is perhaps the most common transformation technique used. This works by taking the original value, and replacing it with the logarithm of that value. R allows us to complete this task quickly, which we will do below:
Trap %>%
#Using Dplyr we select the Trap data file we wish to work with - this is known as piping
mutate(ElectronicLog = log10(Trap$SRCount))%>%
#Using mutate, which creates a new data column, we specify that we want that column to be named "ElectronicLog", and it will be calculated by taking the Log(Base10) of the current Electronic Counts (SRCount).
mutate(TrapLog = log10(Trap$TrapCount)) -> TrapTransformed
#We then pipe those results to our next command, which performs the same transformation on our trap data, and creates a new file to work with called "TrapTransformed"
We will now follow that transformation by replotting the distribution of our variables, and re-running the Shapiro-Wilk Test from above to see if our data is normal, and can be analyzed using parametric statistics.
ggdensity(TrapTransformed$ElectronicLog,
xlab= "Electronic Fish Count During Trap Session",
ylab= "Frequency of Occurrence",
main= "Distribution of Number of Electronic Fish Counted by Trap Session")
##
## Shapiro-Wilk normality test
##
## data: TrapTransformed$ElectronicLog
## W = 0.97743, p-value = 0.001755
ggdensity(TrapTransformed$TrapLog,
xlab= "Fish Trapped During Trap Session",
ylab= "Frequency of Occurrence",
main= "Distribution of Number of Trapped Fish Counted by Trap Session")
##
## Shapiro-Wilk normality test
##
## data: TrapTransformed$TrapLog
## W = 0.97856, p-value = 0.002552
While it is both visually and statistically (both p-values increased) evident that this transformation greatly improved the normality of our variables. We still lack the threshold (p-value > 0.05) to perform a parametric test without introducing an unknown level of bias.
Although many other data transformation techniques exist, let’s just pretend we’ve tried them all – without any of the transformations resulting in a normal distribution for our variables. What other options do we have? One common alternative is to bypass the assumption of normality by using a nonparametric statistical test. While parametric tests are used often more powerful, particularly in studies with a limited sample size, nonparametric alternatives exist for every parametric test. In our case, we have many samples in our data, and therefore we can properly implement the use of a the Paired Samples Wilcoxon Signed Rank Test, a nonparametric alternative of the more common parametric paired t-test.
Similar to the t-test, the Paired Samples Wilcoxon Signed Rank Test will pair the values of the two variables – as they are technically the same fish being measured twice (electronically and physically). The test then assigns a rank value for the absolute difference between each pair, and ranks them in order from the least different to the most different. The least different score (in theory, a difference of 0) is assigned the rank of 1. If multiple pairs have the same difference, as is present in our situation where we have multiple pairs that completely match, the rank for all of the pairs is assigned to the average in which those ranks span. As an example, let’s say we have 3 observation intervals that each completely match. These pairs would be assigned as the 1st, 2nd, and 3rd ranked pairs – but would all receive a score of 2 (the average of 1+2+3).
Remember that the signed rank can also be reported as a negative value, as the variation between the variables is either positive or negative. The test statistic, reported as V in R Studio, is the sum of the signed ranks, and in a test of two variables with the same values, would be reported as V = 0. In practical application, the V score is only indicative of the degree of variation when comparing tests with the same number of variables. Because we know our variables already differ, and we are only interested in whether they are statistically different, we can not use the v score to make any interpretation.
For our analysis, the more useful output for us to focus on is the reported p-value of the test. The alternative hypothesis of the Paired Samples Wilcoxon Signed Rank Test states that the difference between the two variables IS NOT ZERO. In order to “fail to reject” this hypothesis, we would therefore require a p-value of < 0.05. In this case, our p-value is 0.2501, which means that we accept our NULL HYPOTHESIS - that the difference between the electronic fish counts and physical trap count is statistically insignificant. The true power of R is revealed below, in which the entire Paired Samples Wilcoxon Signed Rank Test is conducted almost instantaneously in Lines 151 to 155 below:
#We will use the test to determine if there is a statistical difference in either direction - we specify this by using the "two.sided" alternative hypothesis. Our test is paired, as it is the same subjects being measured twice (once electronically, a once physically in the trap)
wilcox.test(Trap$SRCount, Trap$TrapCount, paired= TRUE, alternative = "two.sided")
##
## Wilcoxon signed rank test with continuity correction
##
## data: Trap$SRCount and Trap$TrapCount
## V = 6940.5, p-value = 0.2501
## alternative hypothesis: true location shift is not equal to 0
Now that we’ve statistically validated our electronic fish counts, let’s use the plotting functions we learned above to graph the daily total count of river herring detected by the electronic fish counter.
#Transform the Counts data into a dataframe to allow ggplot to recognize the "Date" column
DailyPassage.df <- as.data.frame(Counts) %>%
#Pipe the dataframe into the ggplot
ggplot()+
geom_line( size = 1.2, color = "darkred", aes(x=Date, y= `River Herring`))+
#Add our line as a factor of river herring over time
ggtitle("Daily River Herring Passed - Electronic Count")+
#Name the plot
xlab("Date")+
#Name the X-axis
ylab("Daily Herring Counted") -> RiverHerring
#Name the Y-axis, and export the ggplot as "RiverHerring"
RiverHerring
Now it is time to combine the knowledge you’ve gained from this lesson with the background information given in the other documents. You are presented with the following questions:
Question 1: Why is it important to check for normality before running parametric statistics?
Question 2: Pretend you are consulted for input on whether or not recreational fishing should be temporarily closed near the Walker’s Fishway in order to allow river herring to migrate without additional harassment. Which week(s) or month(s) would you choose to close, and how would you justify this decision?
Question 3: Let’s pretend that we were only interested in whether or not the electronic fish counts overestimated river herring migration. We test this by manipulating our Paired Samples Wilcoxon Test as seen below. Does the electronic counter overestimate fish at a statistically significant threshold?
#Now we will use the test to determine whether
wilcox.test(Trap$SRCount, Trap$TrapCount, paired= TRUE, alternative = "greater")
##
## Wilcoxon signed rank test with continuity correction
##
## data: Trap$SRCount and Trap$TrapCount
## V = 6940.5, p-value = 0.1251
## alternative hypothesis: true location shift is greater than 0
Question 4: In this lesson, we highlighted mortality reduction as the primary advantage to using an electronic fish counter over more invasive fish monitoring gears. Can you think of any other advantageous qualities?