This week we are looking at a case study about hypoxia in the Chesapeake Bay. The data were collected over by the Chesapeake Bay Program from 2010 to 2019. These teaching materials have been adapted from work housed by Project EDDIE. The original teaching materials were created by doctors Annette Brickley, Kathy Browne, and Gabi Smalley.
“Aquatic ecosystems are home to a complex intersection of physical and biological factors and an intersection of natural and anthropogenic factors. In the Chesapeake Bay, low oxygen events have occurred periodically and may be connected with harmful algal blooms, fish kills, heavy flooding/runoff events, and warming temperatures. Careful monitoring of the system by the Chesapeake Bay Program since 1984 allows scientists and policymakers to evaluate the causes of the events and monitor improvements in the health of the ecosystem.”
“In this activity, you will analyze a time series of dissolved oxygen concentrations, looking for low oxygen events in the Chesapeake Bay.
“Water quality data are available from the Chesapeake Bay Program at https://www.chesapeakebay.net, collected since 1984. Data for a single station (CB4.1C) in the mesohaline region of the Chesapeake Bay, a few miles south of the Chesapeake Bay Bridge, are provided for 2010-2019.”
“The data available are:
for both surface and bottom depths at station CB4.1C. Some of the data will be used in this activity and the other data will be used for further analysis in the following activities.”
Let us start out by loading the data and looking at the structure of the data frame.
# since we have an XLSX (Excel) file, we will use the read_xlsx() function
df <- readxl::read_xlsx("chesapeake_bay_program_data_student.xlsx")
str(df)
## tibble [138 × 23] (S3: tbl_df/tbl/data.frame)
## $ date : POSIXct[1:138], format: "2010-01-13" "2010-02-15" ...
## $ time : POSIXct[1:138], format: "1899-12-31 10:25:00" "1899-12-31 11:33:00" ...
## $ DO_surface : num [1:138] 13.3 13.3 11.5 9.2 8.9 8.7 6.8 7.9 7 6.2 ...
## $ DO_bottom : num [1:138] 11 8.9 7.6 4.1 2 0.7 0.24 0.28 0.39 0.3 ...
## $ temp_surface : num [1:138] 0.2 0.8 6.1 13.6 16 23.4 25.6 27.1 27.6 26.6 ...
## $ temp_bottom : num [1:138] 2.8 2.6 3.4 7.3 14 17.3 19.8 24.8 25.8 26.3 ...
## $ salinity_surface : num [1:138] 9.33 10.37 10.24 5.94 8.44 ...
## $ salinity_bottom : num [1:138] 16.5 17.3 17.4 16.8 17.4 ...
## $ NO3_surface : num [1:138] 0.623 0.559 0.618 0.791 0.447 ...
## $ NO3_bottom : num [1:138] 0.107 0.158 0.157 0.195 0.118 ...
## $ chlor_surface : num [1:138] 8.39 6.25 14.65 18.42 13.73 ...
## $ chlor_bottom : num [1:138] 32.47 74.76 31.28 35.24 6.56 ...
## $ pheo_surface : num [1:138] 0.687 1.434 0.946 1.482 1.434 ...
## $ pheo_bottom : num [1:138] 9.4 34.64 13.04 22.07 6.36 ...
## $ POC_surface : num [1:138] 0.57 0.687 1.2 1.61 1.26 1.04 1.4 1.83 1.09 1.34 ...
## $ POC_bottom : num [1:138] 2.19 6.28 3.28 3.83 1.11 0.453 0.579 0.532 0.313 0.325 ...
## $ PO4_surface : num [1:138] 0.0033 0.0018 0.0024 0.0025 0.0019 0.0038 0.0011 0.0045 0.0055 0.0165 ...
## $ PO4_bottom : num [1:138] 0.0025 0.0027 0.0033 0.0145 0.0038 0.009 0.0416 0.054 0.102 0.0595 ...
## $ TSS_surface : num [1:138] 5.7 3 2.8 5.7 4.6 3.4 5.2 2.8 3.8 4 ...
## $ TSS_bottom : num [1:138] 27.8 50.5 15.2 54 13.6 5.2 6.8 9.2 6 3.8 ...
## $ density_surface : num [1:138] 1.01 1.01 1.01 1 1.01 ...
## $ density_bottom : num [1:138] 1.01 1.01 1.01 1.01 1.01 ...
## $ density_difference: num [1:138] 0.00573 0.00554 0.00585 0.00923 0.0072 ...
Last time, we observed that nearly all of the hypoxia events (i.e. when the dissolved oxygen concentration levels were below 2 ppt) occurred in deeper waters. For our practice with hypothesis testing, we will need a categorical variable. Here is one way to create a two-category variable by binning the data.
# creating a categorical variable that indicate TRUE or FALSE for hypoxia
df <- df |>
mutate(hypoxia_bool = ifelse(DO_bottom < 2, "hypoxic", "healthy"))
It is impractical for us to measure a quantity at Chesapeake Bay station at every moment in time, but we can compute confidence intervals to gain understanding of the population statistics.
Let us start with the surface temperature.
# build bootstrap distribution
bootstrap_distribution <- df %>%
specify(response = temp_surface) %>%
generate(reps = 1337, type = "bootstrap") %>%
calculate(stat = "mean")
# visualize confidence interval with the bootstrap distribution
visualize(bootstrap_distribution) +
shade_ci(endpoints = bootstrap_distribution |>
get_ci(level = 0.95, type = "percentile"),
color = "#DAA900",
fill = "#002856")
# computing the confidence interval directly from the bootstrap distribution
bootstrap_distribution |>
get_ci(level = 0.95, type = "percentile")
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 15.0 18.0
We are 95 percent confident that the true population mean for the surface temperature is between 15 and 18 degrees Celsius.
Next, let us look at the bottom-water temperature.
# build bootstrap distribution
bootstrap_distribution <- df %>%
specify(response = temp_bottom) %>%
generate(reps = 1337, type = "bootstrap") %>%
calculate(stat = "mean")
# visualize confidence interval with the bootstrap distribution
visualize(bootstrap_distribution) +
shade_ci(endpoints = bootstrap_distribution |>
get_ci(level = 0.95, type = "percentile"),
color = "#DAA900",
fill = "#002856")
# computing the confidence interval directly from the bootstrap distribution
bootstrap_distribution |>
get_ci(level = 0.95, type = "percentile")
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 14.2 17.0
We are 95 percent confident that the true population mean for the bottom-water temperature is between 14 and 17 degrees Celsius.
predictor variable: temp_surface
response variable: DO_bottom
(i.e. we will stick to
the bottom DO today)
null hypothesis: the average surface temperatures are the same regardless of hypoxia events
alternative hypothesis: the average surface temperatures are different when hypoxia is present
\[\begin{array}{rrcl} H_{0}: \mu_{x} - \mu_{y} & = & 0 \\ H_{a}: \mu_{x} - \mu_{y} & \neq & 0 \\ \end{array}\]
df |>
ggplot(aes(x = hypoxia_bool, y = temp_surface)) +
geom_boxplot(aes(fill = hypoxia_bool),
color = "black") +
labs(title = "Chesapeake Bay Temperature Difference",
subtitle = "",
caption = "Data Source: Chesapeake Bay Program",
x = "classification",
y = "temperature (degrees Celsius)") +
theme_minimal()
From our studies, let us go toward computing the p-value.
# build a null distribution
null_distribution <- df %>%
specify(formula = temp_surface ~ hypoxia_bool) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# compute the observed difference in means
obs_diff_means <- df %>%
specify(formula = temp_surface ~ hypoxia_bool) %>%
calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# visualize the null distribution and shade in the p-value
visualize(null_distribution, bins = 10) +
shade_p_value(obs_stat = obs_diff_means, direction = "two_sided")
# compute the p-value directly
null_distribution %>%
get_p_value(obs_stat = obs_diff_means, direction = "two_sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Since the p-value < 0.05, we reject the null hypothesis that the average surface temperatures are the same regardless of hypoxia events.
predictor variable: temp_bottom
response variable: DO_bottom
(i.e. we will stick to
the bottom DO today)
null hypothesis: the average bottom temperatures are the same regardless of hypoxia events
alternative hypothesis: the average bottom temperatures is greater when hypoxia is present
\[\begin{array}{rrcl} H_{0}: \mu_{x} - \mu_{y} & = & 0 \\ H_{a}: \mu_{x} - \mu_{y} & > & 0 \\ \end{array}\]
df |>
ggplot(aes(x = hypoxia_bool, y = temp_bottom)) +
geom_boxplot(aes(fill = hypoxia_bool),
color = "black") +
labs(title = "Chesapeake Bay Temperature Difference",
subtitle = "",
caption = "Data Source: Chesapeake Bay Program",
x = "classification",
y = "temperature (degrees Celsius)") +
theme_minimal()
From our studies, let us go toward computing the p-value.
# build a null distribution
null_distribution <- df %>%
specify(formula = temp_bottom ~ hypoxia_bool) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# compute the observed difference in means
obs_diff_means <- df %>%
specify(formula = temp_bottom ~ hypoxia_bool) %>%
calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# visualize the null distribution and shade in the p-value
visualize(null_distribution, bins = 10) +
shade_p_value(obs_stat = obs_diff_means, direction = "two_sided")
# compute the p-value directly
null_distribution %>%
get_p_value(obs_stat = obs_diff_means, direction = "two_sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Since the p-value < 0.05, we reject the average bottom temperatures are the same regardless of hypoxia events.
When we want to analyze what might affect a response variable with several predictor variables, we can employ ANOVA (analysis of variance).
# wrangling with the data set to keep only the response variable ...
# and the possible predictor variables
df_for_anova <- df |>
select(-date) |> # date and time data are difficult at the Bio 18 level
select(-time) |>
select(-DO_surface) |> #surface and bottom dissolved oxygen levels were too related to be helpful
select(-hypoxia_bool)
# perform ANOVA: DO_bottom 'explained by' all of the other variables
summary(aov(DO_bottom ~ .,
data = df_for_anova))
## Df Sum Sq Mean Sq F value Pr(>F)
## temp_surface 1 1615.2 1615.2 1187.459 < 2e-16 ***
## temp_bottom 1 1.2 1.2 0.909 0.3425
## salinity_surface 1 57.9 57.9 42.542 1.8e-09 ***
## salinity_bottom 1 0.1 0.1 0.053 0.8186
## NO3_surface 1 1.9 1.9 1.416 0.2364
## NO3_bottom 1 24.6 24.6 18.061 4.3e-05 ***
## chlor_surface 1 0.9 0.9 0.627 0.4301
## chlor_bottom 1 0.0 0.0 0.016 0.9007
## pheo_surface 1 0.8 0.8 0.593 0.4426
## pheo_bottom 1 0.1 0.1 0.083 0.7737
## POC_surface 1 0.4 0.4 0.286 0.5936
## POC_bottom 1 2.3 2.3 1.676 0.1980
## PO4_surface 1 0.0 0.0 0.015 0.9019
## PO4_bottom 1 0.2 0.2 0.132 0.7166
## TSS_surface 1 1.0 1.0 0.705 0.4029
## TSS_bottom 1 0.2 0.2 0.122 0.7280
## density_surface 1 6.1 6.1 4.479 0.0364 *
## density_bottom 1 4.1 4.1 2.999 0.0859 .
## Residuals 118 160.5 1.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
The differences in water-bottom dissolved oxygen concentration levels seem to be explained by (note the asterisks correspond to significance levels)
For the pollutant that is assigned to your table, complete the following tasks
Compute sample statistics (mean, median, standard deviation) for the numerical variables both for surface and bottom readings.
# in these instructor solutions, I am going to focus on
# nitrates readings
df |>
summarize(xbar = mean(NO3_surface, na.rm = TRUE),
median = median(NO3_surface, na.rm = TRUE),
s = sd(NO3_surface, na.rm = TRUE))
## # A tibble: 1 × 3
## xbar median s
## <dbl> <dbl> <dbl>
## 1 0.247 0.158 0.261
# in these instructor solutions, I am going to focus on
# water-bottom nitrates readings, since that metric is
# probably the most significant
df |>
summarize(xbar = mean(NO3_bottom, na.rm = TRUE),
median = median(NO3_bottom, na.rm = TRUE),
s = sd(NO3_bottom, na.rm = TRUE))
## # A tibble: 1 × 3
## xbar median s
## <dbl> <dbl> <dbl>
## 1 0.0676 0.0366 0.0834
On the same ggplot
, use a couple of
geom_line()
layers to show the pollutant over time, but
both for surface and bottom readings.
df |>
ggplot() +
geom_line(aes(x = date, y = NO3_surface), color = "blue") +
geom_line(aes(x = date, y = NO3_bottom), color = "purple") +
labs(title = "Chesapeake Bay Oxygen Levels",
subtitle = "surface levels in blue, bottom levels in purple",
caption = "Data Source: Chesapeake Bay Program",
x = "date",
y = "nitrate concentration (mg/L)") +
theme_minimal()
Describe the graph. What observations do you have?
The nitrate levels seem to be lower when we are at the bottom of the bay.
Construct a scatterplot with the following specifications:
DO_bottom
df |>
ggplot(aes(x = NO3_bottom, y = DO_bottom)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Does water density affect oxygen levels?",
subtitle = paste("Correlation: ",
round(cor(df$DO_bottom, df$NO3_bottom,
use = "pairwise.complete.obs"), 4)),
caption = "Data Source: Chesapeake Bay Program",
x = "nitrate concentration (mg/L)",
y = "dissolved oxygen concentration (mg/L)") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
Describe the correlation between the variables.
The nitrate concentration and the dissolved oxygen concentration data at the water bottom are virtually uncorrelated.
Use a linear regression model to predict the water-bottom dissolved oxygen concentration at a reasonable level for your pollutant.
# build a linear model
lin_fit <- lm(DO_bottom ~ NO3_bottom, data = df)
# make a prediction (chose input value "near" sample statistics)
predict(lin_fit, newdata = data.frame(NO3_bottom = 0.04))
## 1
## 3.840552
When the nitrate concentration is 0.04 mg/L, we do not predict hypoxic conditions.
visualize
function to visualize the confidence
intervalvisualize
function to visualize the confidence
interval# build bootstrap distribution
bootstrap_distribution <- df %>%
specify(response = NO3_surface) %>%
generate(reps = 1337, type = "bootstrap") %>%
calculate(stat = "mean")
# visualize confidence interval with the bootstrap distribution
visualize(bootstrap_distribution) +
shade_ci(endpoints = bootstrap_distribution |>
get_ci(level = 0.95, type = "percentile"),
color = "#DAA900",
fill = "#002856")
# computing the confidence interval directly from the bootstrap distribution
bootstrap_distribution |>
get_ci(level = 0.95, type = "percentile")
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.207 0.292
We are 95 percent confident that the true population mean for the nitrate concentration at the water surface is between 0.2039 and 0.2926 mg/L.
# build bootstrap distribution
bootstrap_distribution <- df %>%
specify(response = NO3_bottom) %>%
generate(reps = 1337, type = "bootstrap") %>%
calculate(stat = "mean")
# visualize confidence interval with the bootstrap distribution
visualize(bootstrap_distribution) +
shade_ci(endpoints = bootstrap_distribution |>
get_ci(level = 0.95, type = "percentile"),
color = "#DAA900",
fill = "#002856")
# computing the confidence interval directly from the bootstrap distribution
bootstrap_distribution |>
get_ci(level = 0.95, type = "percentile")
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.0542 0.0822
We are 95 percent confident that the true population mean for the nitrate concentration at the water bottom is between 0.0548 and 0.0819 mg/L.
hypoxia_bool
In words, clearly define a null hypothesis and an alternative hypothesis toward your hypothesis test
Create a boxplot to show the numerical variable versus the categorical variable.
df |>
ggplot(aes(x = hypoxia_bool, y = NO3_bottom)) +
geom_boxplot(aes(fill = hypoxia_bool),
color = "black") +
labs(title = "Chesapeake Bay Nitrates",
subtitle = "",
caption = "Data Source: Chesapeake Bay Program",
x = "classification",
y = "nitrate concentration (mg/L)") +
theme_minimal()
Build a null distribution and visualize
the null
distribution.
# build a null distribution
null_distribution <- df %>%
specify(formula = NO3_bottom ~ hypoxia_bool) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# compute the observed difference in means
obs_diff_means <- df %>%
specify(formula = NO3_bottom ~ hypoxia_bool) %>%
calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# visualize the null distribution and shade in the p-value
visualize(null_distribution, bins = 10) +
shade_p_value(obs_stat = obs_diff_means, direction = "two_sided")
# compute the p-value directly
null_distribution %>%
get_p_value(obs_stat = obs_diff_means, direction = "two_sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.048
Finally, interpret the p-value.
Since the p-value is less than 0.05, we reject the null hypothesis of equal nitrate concentration levels regardless of hypoxia presence.