---
title: "Hypoxia Case Study (Day 1)"
author: '[your name]'
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  word_document:
    toc: yes
  html_document:
    theme: cerulean
    toc: yes
  pdf_document:
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library("tidyverse")
```

# About the Data

This week we are looking at a case study about hypoxia in the Chesapeake Bay.  The data were collected over by the Chesapeake Bay Program from 2010 to 2019.  These teaching materials have been adapted from work housed by Project EDDIE.  The original teaching materials were created by doctors Annette Brickley, Kathy Browne, and Gabi Smalley.

"Aquatic ecosystems are home to a complex intersection of physical and biological factors and an intersection of natural and anthropogenic factors. In the Chesapeake Bay, low oxygen events have occurred periodically and may be connected with harmful algal blooms, fish kills, heavy flooding/runoff events, and warming temperatures. Careful monitoring of the system by the Chesapeake Bay Program since 1984 allows scientists and policymakers to evaluate the causes of the events and monitor improvements in the health of the ecosystem."

"In this activity, you will analyze a time series of dissolved oxygen concentrations, looking for low oxygen events in the Chesapeake Bay. 

* When oxygen concentrations fall below 2 mg/L, conditions are called *hypoxic*.
* When concentrations are approximately 0 mg/L, conditions are *anoxic*. 

"Water quality data are available from the Chesapeake Bay Program at https://www.chesapeakebay.net, collected since 1984. Data for a single station (CB4.1C) in the mesohaline region of the Chesapeake Bay, a few miles south of the Chesapeake Bay Bridge, are provided for 2010-2019."

"The data available are: 

* sample date and time
* dissolved oxygen concentration (DO)
* water temperature
* salinity
* nitrate concentration (NO3-)
* phosphate concentration (PO4-3)
* chlorophyll concentration (ChlA)
* pheophytin concentration (Pheo)
* particulate organic carbon (POC)
* total suspended solids (TSS) 

for both surface and bottom depths at station CB4.1C. Some of the data will be used in this activity and the other data will be used for further analysis in the following activities."

# Lecture: Suspecting Density

Let us start out by loading the data and looking at the structure of the data frame.

```{r}
# since we have an XLSX (Excel) file, we will use the read_xlsx() function
df <- readxl::read_xlsx("chesapeake_bay_program_data_student.xlsx")

str(df)
```

## Exploratory Data Analysis

We can compute sample statistics (mean, median, standard deviation) for the dissolved oxygen.  Let us do so for both the surface and bottom readings.

```{r}
df |>
  summarize(xbar = mean(DO_surface, na.rm = TRUE),
            median = median(DO_surface, na.rm = TRUE),
            s = sd(DO_surface, na.rm = TRUE))
```

```{r}
df |>
  summarize(xbar = mean(DO_bottom, na.rm = TRUE),
            median = median(DO_bottom, na.rm = TRUE),
            s = sd(DO_bottom, na.rm = TRUE))
```

Next, we will look at some line graphs of the data.  Observe how the data are introduced in layers in the code.

```{r}
df |>
  ggplot() +
  geom_line(aes(x = date, y = DO_surface), color = "blue") +
  geom_line(aes(x = date, y = DO_bottom), color = "purple") +
  geom_hline(yintercept = 2, color = "red") + #only pertinent for this graph
  labs(title = "Chesapeake Bay Oxygen Levels",
       subtitle = "surface levels in blue, bottom levels in purple",
       caption = "Data Source: Chesapeake Bay Program",
       x = "date",
       y = "dissolved oxygen concentration (mg/L)") +
  theme_minimal()
```

Describe the plot.  What observations do you have about the data?



## Probability Calculations

To include some of our earlier studies about probability, let us assume that the dissolved oxygen levels are normally distributed.  What is the probability that a randomly selected surface reading shows hypoxia?

```{r}
pnorm(2, #i.e. hypoxic when dissolved oxygen < 2 milligrams per liter
      mean(df$DO_surface, na.rm = TRUE),
      sd(df$DO_surface, na.rm = TRUE))
```

What is the probability that a randomly selected bottom reading shows hypoxia?

```{r}
pnorm(2, #i.e. hypoxic when dissolved oxygen < 2 milligrams per liter
      mean(df$DO_bottom, na.rm = TRUE),
      sd(df$DO_bottom, na.rm = TRUE))
```

Hereafter, we will focus on the water-bottom oxygen readings.

This data set does not have an obvious choice of a categorical variable for us to practice hypothesis testing, so here is a quick bit of code to `bin` numerical data into a binary categorical variable.

```{r}
# creating a categorical variable that indicate TRUE or FALSE for hypoxia
df <- df |>
  mutate(hypoxia_bool = ifelse(DO_bottom < 2, "hypoxic", "healthy"))
```

## Correlation and Regression

We will move toward trying to find a predictor variable that explains well the differences in dissolved oxygen.  With two numerical variables, we should employ a scatterplot.

```{r}
df |>
  ggplot(aes(x = density_bottom, y = DO_bottom)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Does water density affect oxygen levels?",
       subtitle = paste("Correlation: ",
                        round(cor(df$DO_bottom, df$density_bottom,
                                  use = "pairwise.complete.obs"), 4)),
       caption = "Data Source: Chesapeake Bay Program",
       x = "water density (kg/L)",
       y = "dissolved oxygen concentration (mg/L)") +
  theme_minimal()
```

Describe the correlation between the variables.

The water density and dissolved oxygen water-bottom data are slightly and positively correlated.

For practice, here we predict the dissolved oxygen concentration when the water-bottom density is 1.018 kg/L.

```{r}
# build a linear model
lin_fit <- lm(DO_bottom ~ density_bottom, data = df)

# make a prediction
predict(lin_fit, newdata = data.frame(density_bottom = 1.018))
```

At a water density level of 1.018 kg/L, our model does not imply hypoxic conditions.


# Exercise: Suspecting Salinity

## Exploratory Data Analysis

Compute sample statistics (mean, median, standard deviation) for the `salinity_bottom` numerical variable.

```{r}
df |>
  summarize(xbar = mean(salinity_bottom, na.rm = TRUE),
            median = median(salinity_bottom, na.rm = TRUE),
            s = sd(salinity_bottom, na.rm = TRUE))
```

On the same `ggplot`, use a couple of `geom_line()` layers to show the salinity over time, but both for surface and bottom readings.

```{r}
df |>
  ggplot() +
  geom_line(aes(x = date, y = salinity_surface), color = "blue") +
  geom_line(aes(x = date, y = salinity_bottom), color = "purple") +
  labs(title = "Chesapeake Bay Salinity Levels",
       subtitle = "surface levels in blue, bottom levels in purple",
       caption = "Data Source: Chesapeake Bay Program",
       x = "date",
       y = "salinity (ppt)") +
  theme_minimal()
```

Describe the graph.  What observations do you have?



## Probability Calculations

Assume a normal distribution and compute the probability that a randomly selected water-bottom observation took place in freshwater (i.e. salinity < 35 ppt).

```{r}
pnorm(35, mean(df$salinity_bottom), sd(df$salinity_bottom))
```

Assume a normal distribution and compute the probability that a randomly selected water-bottom observation took place in saltwater (i.e. salinity > 35 ppt).

```{r}
1 - pnorm(35, mean(df$salinity_bottom), sd(df$salinity_bottom))
```

Use the above calculations to make a quick observation about the Chesapeake Bay.

Chesapeake Bay is a freshwater body of water.

## Correlation and Regression

Construct a scatterplot with the following specifications:

* predictor variable: `salinity_bottom`
* response variable: `DO_bottom`
* add a layer to show a linear regression line
* place the numerical correlation somewhere on the graph

```{r}
df |>
  ggplot(aes(x = salinity_bottom, y = DO_bottom)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Does salinity affect oxygen levels?",
       subtitle = paste("Correlation: ",
                        round(cor(df$DO_bottom, df$salinity_bottom,
                                  use = "pairwise.complete.obs"), 4)),
       caption = "Data Source: Chesapeake Bay Program",
       x = "salinity (ppt)",
       y = "dissolved oxygen concentration (mg/L)") +
  theme_minimal()
```

Describe the correlation between the variables.

The salinity and dissolved oxygen water-bottom data are virtually uncorrelated.

Use a linear regression model to predict the water-bottom dissolved oxygen concentration when the water-bottom salinity is 18 ppt.

```{r}
# build a linear model
lin_fit <- lm(DO_bottom ~ salinity_bottom, data = df)

# make a prediction
predict(lin_fit, newdata = data.frame(salinity_bottom = 18))
```


# Exercise: Suspecting Temperature

## Exploratory Data Analysis

Compute sample statistics (mean, median, standard deviation) for the `temp_surface` and `temp_bottom` numerical variables (units: degrees Celsius)

```{r}
df |>
  summarize(xbar = mean(temp_surface, na.rm = TRUE),
            median = median(temp_surface, na.rm = TRUE),
            s = sd(temp_surface, na.rm = TRUE))
```


```{r}
df |>
  summarize(xbar = mean(temp_bottom, na.rm = TRUE),
            median = median(temp_bottom, na.rm = TRUE),
            s = sd(temp_bottom, na.rm = TRUE))
```

On the same `ggplot`, use a couple of `geom_line()` layers to show the water temperature over time, but both for surface and bottom readings.

```{r}
df |>
  ggplot() +
  geom_line(aes(x = date, y = temp_surface), color = "blue") +
  geom_line(aes(x = date, y = temp_bottom), color = "purple") +
  labs(title = "Chesapeake Bay Density Levels",
       subtitle = "surface levels in blue, bottom levels in purple",
       caption = "Data Source: Chesapeake Bay Program",
       x = "date",
       y = "temperature (degrees Celsius)") +
  theme_minimal()
```

Describe the graph.  What observations do you have?




## Correlation and Regression

Construct a scatterplot with the following specifications:

* predictor variable: `temp_bottom`
* response variable: `DO_bottom`
* add a layer to show a linear regression line
* place the numerical correlation somewhere on the graph

```{r}
df |>
  ggplot(aes(x = temp_bottom, y = DO_bottom)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Does density affect oxygen levels?",
       subtitle = paste("Correlation: ",
                        round(cor(df$DO_bottom, df$temp_bottom,
                                  use = "pairwise.complete.obs"), 4)),
       caption = "Data Source: Chesapeake Bay Program",
       x = "temperature (degrees Celsius)",
       y = "dissolved oxygen concentration (mg/L)") +
  theme_minimal()
```

Describe the correlation between the variables.

The temperature and dissolved oxygen water-bottom data are highly and positively correlated.

Use a linear regression model to predict the water-bottom dissolved oxygen concentration when the water-bottom temperature is 18 degrees Celsius.

```{r}
# build a linear model
lin_fit <- lm(DO_bottom ~ temp_bottom, data = df)

# make a prediction
predict(lin_fit, newdata = data.frame(temp_bottom = 18))
```


# Exercise: Suspecting Stratification

## Exploratory Data Analysis

Compute sample statistics (mean, median, standard deviation) for the `density_difference` numerical variable (units: kg/L)

```{r}
df |>
  summarize(xbar = mean(density_difference, na.rm = TRUE),
            median = median(density_difference, na.rm = TRUE),
            s = sd(density_difference, na.rm = TRUE))
```

On a `ggplot`, use a `geom_line()` layer to show the density over time.

```{r}
df |>
  ggplot() +
  geom_line(aes(x = date, y = density_difference), color = "blue") +
  labs(title = "Chesapeake Bay Density Difference",
       subtitle = "",
       caption = "Data Source: Chesapeake Bay Program",
       x = "date",
       y = "density difference (kg/L)") +
  theme_minimal()
```


## Correlation and Regression

Construct a scatterplot with the following specifications:

* predictor variable: `density_difference`
* response variable: `DO_bottom`
* add a layer to show a linear regression line
* place the numerical correlation somewhere on the graph

```{r}
df |>
  ggplot(aes(x = density_difference, y = DO_bottom)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Does density difference affect oxygen levels?",
       subtitle = paste("Correlation: ",
                        round(cor(df$DO_bottom, df$density_difference,
                                  use = "pairwise.complete.obs"), 4)),
       caption = "Data Source: Chesapeake Bay Program",
       x = "density difference (kg/L)",
       y = "dissolved oxygen concentration (mg/L)") +
  theme_minimal()
```

Describe the correlation between the variables.

The density and dissolved oxygen water-bottom data are slightly and negatively uncorrelated.

Use a linear regression model to predict the water-bottom dissolved oxygen concentration when the water-bottom density is 0.0018 kg/L.

```{r}
# build a linear model
lin_fit <- lm(DO_bottom ~ density_difference, data = df)

# make a prediction
predict(lin_fit, newdata = data.frame(density_difference = 0.0018))
```