About the Data

This week we are looking at a case study about hypoxia in the Chesapeake Bay. The data were collected over by the Chesapeake Bay Program from 2010 to 2019. These teaching materials have been adapted from work housed by Project EDDIE. The original teaching materials were created by doctors Annette Brickley, Kathy Browne, and Gabi Smalley.

“Aquatic ecosystems are home to a complex intersection of physical and biological factors and an intersection of natural and anthropogenic factors. In the Chesapeake Bay, low oxygen events have occurred periodically and may be connected with harmful algal blooms, fish kills, heavy flooding/runoff events, and warming temperatures. Careful monitoring of the system by the Chesapeake Bay Program since 1984 allows scientists and policymakers to evaluate the causes of the events and monitor improvements in the health of the ecosystem.”

“In this activity, you will analyze a time series of dissolved oxygen concentrations, looking for low oxygen events in the Chesapeake Bay.

“Water quality data are available from the Chesapeake Bay Program at https://www.chesapeakebay.net, collected since 1984. Data for a single station (CB4.1C) in the mesohaline region of the Chesapeake Bay, a few miles south of the Chesapeake Bay Bridge, are provided for 2010-2019.”

“The data available are:

for both surface and bottom depths at station CB4.1C. Some of the data will be used in this activity and the other data will be used for further analysis in the following activities.”

Let us start out by loading the data and looking at the structure of the data frame.

# since we have an XLSX (Excel) file, we will use the read_xlsx() function
df <- readxl::read_xlsx("chesapeake_bay_program_data_student.xlsx")

str(df)
## tibble [138 × 23] (S3: tbl_df/tbl/data.frame)
##  $ date              : POSIXct[1:138], format: "2010-01-13" "2010-02-15" ...
##  $ time              : POSIXct[1:138], format: "1899-12-31 10:25:00" "1899-12-31 11:33:00" ...
##  $ DO_surface        : num [1:138] 13.3 13.3 11.5 9.2 8.9 8.7 6.8 7.9 7 6.2 ...
##  $ DO_bottom         : num [1:138] 11 8.9 7.6 4.1 2 0.7 0.24 0.28 0.39 0.3 ...
##  $ temp_surface      : num [1:138] 0.2 0.8 6.1 13.6 16 23.4 25.6 27.1 27.6 26.6 ...
##  $ temp_bottom       : num [1:138] 2.8 2.6 3.4 7.3 14 17.3 19.8 24.8 25.8 26.3 ...
##  $ salinity_surface  : num [1:138] 9.33 10.37 10.24 5.94 8.44 ...
##  $ salinity_bottom   : num [1:138] 16.5 17.3 17.4 16.8 17.4 ...
##  $ NO3_surface       : num [1:138] 0.623 0.559 0.618 0.791 0.447 ...
##  $ NO3_bottom        : num [1:138] 0.107 0.158 0.157 0.195 0.118 ...
##  $ chlor_surface     : num [1:138] 8.39 6.25 14.65 18.42 13.73 ...
##  $ chlor_bottom      : num [1:138] 32.47 74.76 31.28 35.24 6.56 ...
##  $ pheo_surface      : num [1:138] 0.687 1.434 0.946 1.482 1.434 ...
##  $ pheo_bottom       : num [1:138] 9.4 34.64 13.04 22.07 6.36 ...
##  $ POC_surface       : num [1:138] 0.57 0.687 1.2 1.61 1.26 1.04 1.4 1.83 1.09 1.34 ...
##  $ POC_bottom        : num [1:138] 2.19 6.28 3.28 3.83 1.11 0.453 0.579 0.532 0.313 0.325 ...
##  $ PO4_surface       : num [1:138] 0.0033 0.0018 0.0024 0.0025 0.0019 0.0038 0.0011 0.0045 0.0055 0.0165 ...
##  $ PO4_bottom        : num [1:138] 0.0025 0.0027 0.0033 0.0145 0.0038 0.009 0.0416 0.054 0.102 0.0595 ...
##  $ TSS_surface       : num [1:138] 5.7 3 2.8 5.7 4.6 3.4 5.2 2.8 3.8 4 ...
##  $ TSS_bottom        : num [1:138] 27.8 50.5 15.2 54 13.6 5.2 6.8 9.2 6 3.8 ...
##  $ density_surface   : num [1:138] 1.01 1.01 1.01 1 1.01 ...
##  $ density_bottom    : num [1:138] 1.01 1.01 1.01 1.01 1.01 ...
##  $ density_difference: num [1:138] 0.00573 0.00554 0.00585 0.00923 0.0072 ...

Last time, we observed that nearly all of the hypoxia events (i.e. when the dissolved oxygen concentration levels were below 2 ppt) occurred in deeper waters. For our practice with hypothesis testing, we will need a categorical variable. Here is one way to create a two-category variable by binning the data.

# creating a categorical variable that indicate TRUE or FALSE for hypoxia
df <- df |>
  mutate(hypoxia_bool = ifelse(DO_bottom < 2, "hypoxic", "healthy"))

Lecture: Suspecting Temperature

Confidence Intervals

It is impractical for us to measure a quantity at Chesapeake Bay station at every moment in time, but we can compute confidence intervals to gain understanding of the population statistics.

Let us start with the surface temperature.

# build bootstrap distribution
bootstrap_distribution <- df %>%
  specify(response = temp_surface) %>%
  generate(reps = 1337, type = "bootstrap") %>%
  calculate(stat = "mean")
# visualize confidence interval with the bootstrap distribution
visualize(bootstrap_distribution) +
  shade_ci(endpoints = bootstrap_distribution |>
             get_ci(level = 0.95, type = "percentile"),
           color = "#DAA900", 
           fill = "#002856")

# computing the confidence interval directly from the bootstrap distribution
bootstrap_distribution |>
  get_ci(level = 0.95, type = "percentile")
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     15.0     18.0

We are 95 percent confident that the true population mean for the surface temperature is between 15 and 18 degrees Celsius.

Next, let us look at the bottom-water temperature.

# build bootstrap distribution
bootstrap_distribution <- df %>%
  specify(response = temp_bottom) %>%
  generate(reps = 1337, type = "bootstrap") %>%
  calculate(stat = "mean")
# visualize confidence interval with the bootstrap distribution
visualize(bootstrap_distribution) +
  shade_ci(endpoints = bootstrap_distribution |>
             get_ci(level = 0.95, type = "percentile"),
           color = "#DAA900", 
           fill = "#002856")

# computing the confidence interval directly from the bootstrap distribution
bootstrap_distribution |>
  get_ci(level = 0.95, type = "percentile")
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     14.2     17.0

We are 95 percent confident that the true population mean for the bottom-water temperature is between 14 and 17 degrees Celsius.

Hypothesis Testing

  • predictor variable: temp_surface

  • response variable: DO_bottom (i.e. we will stick to the bottom DO today)

  • null hypothesis: the average surface temperatures are the same regardless of hypoxia events

  • alternative hypothesis: the average surface temperatures are different when hypoxia is present

\[\begin{array}{rrcl} H_{0}: \mu_{x} - \mu_{y} & = & 0 \\ H_{a}: \mu_{x} - \mu_{y} & \neq & 0 \\ \end{array}\]

df |>
  ggplot(aes(x = hypoxia_bool, y = temp_surface)) +
  geom_boxplot(aes(fill = hypoxia_bool),
               color = "black") +
  labs(title = "Chesapeake Bay Temperature Difference",
       subtitle = "",
       caption = "Data Source: Chesapeake Bay Program",
       x = "classification",
       y = "temperature (degrees Celsius)") +
  theme_minimal()

From our studies, let us go toward computing the p-value.

# build a null distribution
null_distribution <- df %>% 
  specify(formula = temp_surface ~ hypoxia_bool) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# compute the observed difference in means
obs_diff_means <- df %>% 
  specify(formula = temp_surface ~ hypoxia_bool) %>% 
  calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# visualize the null distribution and shade in the p-value
visualize(null_distribution, bins = 10) + 
  shade_p_value(obs_stat = obs_diff_means, direction = "two_sided")

# compute the p-value directly
null_distribution %>% 
  get_p_value(obs_stat = obs_diff_means, direction = "two_sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Since the p-value < 0.05, we reject the null hypothesis that the average surface temperatures are the same regardless of hypoxia events.


  • predictor variable: temp_bottom

  • response variable: DO_bottom (i.e. we will stick to the bottom DO today)

  • null hypothesis: the average bottom temperatures are the same regardless of hypoxia events

  • alternative hypothesis: the average bottom temperatures is greater when hypoxia is present

\[\begin{array}{rrcl} H_{0}: \mu_{x} - \mu_{y} & = & 0 \\ H_{a}: \mu_{x} - \mu_{y} & > & 0 \\ \end{array}\]

df |>
  ggplot(aes(x = hypoxia_bool, y = temp_bottom)) +
  geom_boxplot(aes(fill = hypoxia_bool),
               color = "black") +
  labs(title = "Chesapeake Bay Temperature Difference",
       subtitle = "",
       caption = "Data Source: Chesapeake Bay Program",
       x = "classification",
       y = "temperature (degrees Celsius)") +
  theme_minimal()

From our studies, let us go toward computing the p-value.

# build a null distribution
null_distribution <- df %>% 
  specify(formula = temp_bottom ~ hypoxia_bool) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# compute the observed difference in means
obs_diff_means <- df %>% 
  specify(formula = temp_bottom ~ hypoxia_bool) %>% 
  calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# visualize the null distribution and shade in the p-value
visualize(null_distribution, bins = 10) + 
  shade_p_value(obs_stat = obs_diff_means, direction = "two_sided")

# compute the p-value directly
null_distribution %>% 
  get_p_value(obs_stat = obs_diff_means, direction = "two_sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Since the p-value < 0.05, we reject the average bottom temperatures are the same regardless of hypoxia events.

(optional) ANOVA

When we want to analyze what might affect a response variable with several predictor variables, we can employ ANOVA (analysis of variance).

# wrangling with the data set to keep only the response variable ...
# and the possible predictor variables
df_for_anova <- df |>
  select(-date) |> # date and time data are difficult at the Bio 18 level
  select(-time) |>
  select(-DO_surface) |> #surface and bottom dissolved oxygen levels were too related to be helpful
  select(-hypoxia_bool)
# perform ANOVA: DO_bottom 'explained by' all of the other variables
summary(aov(DO_bottom ~ .,
            data = df_for_anova))
##                   Df Sum Sq Mean Sq  F value  Pr(>F)    
## temp_surface       1 1615.2  1615.2 1187.459 < 2e-16 ***
## temp_bottom        1    1.2     1.2    0.909  0.3425    
## salinity_surface   1   57.9    57.9   42.542 1.8e-09 ***
## salinity_bottom    1    0.1     0.1    0.053  0.8186    
## NO3_surface        1    1.9     1.9    1.416  0.2364    
## NO3_bottom         1   24.6    24.6   18.061 4.3e-05 ***
## chlor_surface      1    0.9     0.9    0.627  0.4301    
## chlor_bottom       1    0.0     0.0    0.016  0.9007    
## pheo_surface       1    0.8     0.8    0.593  0.4426    
## pheo_bottom        1    0.1     0.1    0.083  0.7737    
## POC_surface        1    0.4     0.4    0.286  0.5936    
## POC_bottom         1    2.3     2.3    1.676  0.1980    
## PO4_surface        1    0.0     0.0    0.015  0.9019    
## PO4_bottom         1    0.2     0.2    0.132  0.7166    
## TSS_surface        1    1.0     1.0    0.705  0.4029    
## TSS_bottom         1    0.2     0.2    0.122  0.7280    
## density_surface    1    6.1     6.1    4.479  0.0364 *  
## density_bottom     1    4.1     4.1    2.999  0.0859 .  
## Residuals        118  160.5     1.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness

The differences in water-bottom dissolved oxygen concentration levels seem to be explained by (note the asterisks correspond to significance levels)

  • surface temperatures
  • surface salinity levels
  • surface density levels

Exercises: Explore a Pollutant

For the pollutant that is assigned to your table, complete the following tasks

Compute sample statistics (mean, median, standard deviation) for the numerical variables both for surface and bottom readings.

# in these instructor solutions, I am going to focus on 
# nitrates readings
df |>
  summarize(xbar = mean(NO3_surface, na.rm = TRUE),
            median = median(NO3_surface, na.rm = TRUE),
            s = sd(NO3_surface, na.rm = TRUE))
## # A tibble: 1 × 3
##    xbar median     s
##   <dbl>  <dbl> <dbl>
## 1 0.247  0.158 0.261
# in these instructor solutions, I am going to focus on 
# water-bottom nitrates readings, since that metric is
# probably the most significant
df |>
  summarize(xbar = mean(NO3_bottom, na.rm = TRUE),
            median = median(NO3_bottom, na.rm = TRUE),
            s = sd(NO3_bottom, na.rm = TRUE))
## # A tibble: 1 × 3
##     xbar median      s
##    <dbl>  <dbl>  <dbl>
## 1 0.0676 0.0366 0.0834

On the same ggplot, use a couple of geom_line() layers to show the pollutant over time, but both for surface and bottom readings.

df |>
  ggplot() +
  geom_line(aes(x = date, y = NO3_surface), color = "blue") +
  geom_line(aes(x = date, y = NO3_bottom), color = "purple") +
  labs(title = "Chesapeake Bay Oxygen Levels",
       subtitle = "surface levels in blue, bottom levels in purple",
       caption = "Data Source: Chesapeake Bay Program",
       x = "date",
       y = "nitrate concentration (mg/L)") +
  theme_minimal()

Describe the graph. What observations do you have?

The nitrate levels seem to be lower when we are at the bottom of the bay.

Correlation and Regression

Construct a scatterplot with the following specifications:

  • predictor variable: your pollutant readings for the water bottom
  • response variable: DO_bottom
  • add a layer to show a linear regression line
  • place the numerical correlation somewhere on the graph
df |>
  ggplot(aes(x = NO3_bottom, y = DO_bottom)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Does water density affect oxygen levels?",
       subtitle = paste("Correlation: ",
                        round(cor(df$DO_bottom, df$NO3_bottom,
                                  use = "pairwise.complete.obs"), 4)),
       caption = "Data Source: Chesapeake Bay Program",
       x = "nitrate concentration (mg/L)",
       y = "dissolved oxygen concentration (mg/L)") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

Describe the correlation between the variables.

The nitrate concentration and the dissolved oxygen concentration data at the water bottom are virtually uncorrelated.

Use a linear regression model to predict the water-bottom dissolved oxygen concentration at a reasonable level for your pollutant.

# build a linear model
lin_fit <- lm(DO_bottom ~ NO3_bottom, data = df)

# make a prediction (chose input value "near" sample statistics)
predict(lin_fit, newdata = data.frame(NO3_bottom = 0.04))
##        1 
## 3.840552

When the nitrate concentration is 0.04 mg/L, we do not predict hypoxic conditions.

Confidence Intervals

  • build a 95 percent confidence interval for the water surface quantities of your pollutant
  • use the visualize function to visualize the confidence interval
  • build a 95 percent confidence interval for the water bottom quantities of your pollutant
  • use the visualize function to visualize the confidence interval
  • describe each confidence interval
# build bootstrap distribution
bootstrap_distribution <- df %>%
  specify(response = NO3_surface) %>%
  generate(reps = 1337, type = "bootstrap") %>%
  calculate(stat = "mean")
# visualize confidence interval with the bootstrap distribution
visualize(bootstrap_distribution) +
  shade_ci(endpoints = bootstrap_distribution |>
             get_ci(level = 0.95, type = "percentile"),
           color = "#DAA900", 
           fill = "#002856")

# computing the confidence interval directly from the bootstrap distribution
bootstrap_distribution |>
  get_ci(level = 0.95, type = "percentile")
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.207    0.292

We are 95 percent confident that the true population mean for the nitrate concentration at the water surface is between 0.2039 and 0.2926 mg/L.

# build bootstrap distribution
bootstrap_distribution <- df %>%
  specify(response = NO3_bottom) %>%
  generate(reps = 1337, type = "bootstrap") %>%
  calculate(stat = "mean")
# visualize confidence interval with the bootstrap distribution
visualize(bootstrap_distribution) +
  shade_ci(endpoints = bootstrap_distribution |>
             get_ci(level = 0.95, type = "percentile"),
           color = "#DAA900", 
           fill = "#002856")

# computing the confidence interval directly from the bootstrap distribution
bootstrap_distribution |>
  get_ci(level = 0.95, type = "percentile")
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1   0.0542   0.0822

We are 95 percent confident that the true population mean for the nitrate concentration at the water bottom is between 0.0548 and 0.0819 mg/L.

Hypothesis Testing

  • predictor variable: hypoxia_bool
  • response variable: your pollutant readings for the water bottom

In words, clearly define a null hypothesis and an alternative hypothesis toward your hypothesis test

  • null hypothesis: nitrate levels are the same regardless of hypoxic water conditions
  • alternative hypothesis: nitrate levels are lower when hypoxia is present

Create a boxplot to show the numerical variable versus the categorical variable.

df |>
  ggplot(aes(x = hypoxia_bool, y = NO3_bottom)) +
  geom_boxplot(aes(fill = hypoxia_bool),
               color = "black") +
  labs(title = "Chesapeake Bay Nitrates",
       subtitle = "",
       caption = "Data Source: Chesapeake Bay Program",
       x = "classification",
       y = "nitrate concentration (mg/L)") +
  theme_minimal()

Build a null distribution and visualize the null distribution.

# build a null distribution
null_distribution <- df %>% 
  specify(formula = NO3_bottom ~ hypoxia_bool) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# compute the observed difference in means
obs_diff_means <- df %>% 
  specify(formula = NO3_bottom ~ hypoxia_bool) %>% 
  calculate(stat = "diff in means", order = c("hypoxic", "healthy"))
# visualize the null distribution and shade in the p-value
visualize(null_distribution, bins = 10) + 
  shade_p_value(obs_stat = obs_diff_means, direction = "two_sided")

# compute the p-value directly
null_distribution %>% 
  get_p_value(obs_stat = obs_diff_means, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.048

Finally, interpret the p-value.

Since the p-value is less than 0.05, we reject the null hypothesis of equal nitrate concentration levels regardless of hypoxia presence.