Ch7_html.utf8

Today’s investigation. Human-animal interactions can be defining moments in life. However, when resources become limited, these interactions develop into conflicts where humans compete and often displace and re-shape wild populations. This problem has motivated conservation biologists to study the increasing frequency of human-wildlife conflicts. A fundamental step in studying wild populations from a conservation perspective involves the description of their size structure and growth. Today, we will explore how to study some of these population parameters employing the normal distribution, a statistical tool widely employed to describe many biological variables, using a dataset of round stingrays from CSULB Shark Lab.

Introduction

In this lab, we will examine the normal distribution and its Z-score standardization to describe the size structure of the round stingray population of Seal Beach, California (Figure 1). The increasing human population of southern California, coupled with the high population density of stingrays, has developed into a human-wildlife conflict resulting in many reported injuries. To maintain a balance between healthy fish populations and human recreation, it is crucial to understand the distribution of individuals in space.

Today, we will test whether the size structure of round stingrays at Southern California follows a normal distribution or if it is shifted towards a particular size class. For this, we will use data from CSULB Shark Lab’s surveys led by Dr. Chris Lowe (Figure 1). In his surveys, Dr. Lowe and colleagues used a large fishing seine (Figure 1, left panel) to capture and measure the body size of live round stingrays. So, let’s explore the normal distribution and how it is a useful statistical tool for describing the proportion of sizes observed in this round stingray population.

Figure 1. Round stingray Urobatis halleri collection and sampling. Stingrays were collected at Seal Beach, California. A 30 m long by 4.5 m tall seine was used to collect stingrays (left panel). The disc width of collected stingrays was measured (right panel). Images: Chris Lowe.

Upon completion of this lab, you should be able to:

Describe the properties of the normal distribution;
Carry out the Z-standardization;
Estimate the area under the normal curve (probability) for a particular range of values.

References:

Worked example

So far, we have seen several probability distributions. In Chapter 4, we introduced the population and sampling distributions, and in Chapter 5 we discussed the null probability distribution. If we look closer, all these probability distributions have a bell shape centered in a mean value.

Let’s explore this type of distribution commonly observed in biological data.

1. Defining the normal distribution

The normal distribution is a continuous probability distribution (see Chapter 4), meaning it describes the probability distribution of a continuous variable. It is symmetric and it is centered in its mean value. That is, the further a value is from the mean, the lower the probability density of observations (Figure 2).

Say we are interested in describing the diastolic blood pressure of a hypothetical human population of 20,000 adults. After collecting the data, you observe the following distribution of values:

Figure 2. Frequency distribution of diastolic blood pressure for a theoretical human population (n = 20,000; left panel) and the probability density curve (right panel).

Such data has a mean = 70 mmHg and a standard deviation = 10 mmHg. Now, say we want to know how well these data fits a normal distribution. So, let’s fit a normal distribution to this data (mean = 70, standard deviation = 10):

Figure 3. Fitted normal distribution (red curve) to the observed data (gray curve).

In this case, the fitted normal distribution follows the observed data pretty well and thus we can be confident that our data is normally distributed (Figure 3). This brings us to an important aspect of the normal distribution: a normal distribution is described by two parameters; the mean $\mu$ (location) and the standard deviation $\sigma$ (spread). As we discussed in Chapter 4, the y-axis shows a probability density (not a count), and thus to get the probability of a value of diastolic blood pressure, we need to calculate the area under the curve between the range (integration through calculus).

There are common features in any normal distribution. It is continuous and symmetric, it has a single mode and the probability density is highest at the mean. In other words, the mean, the median, and the mode, are all equal to each other.

Because of this, we can describe the area under the curve of a normal distribution using the location and the spread ($\mu$ and $\sigma$):

In a normal distribution, 68.3% of the values are within ± 1 standard deviation, while 95% of them are within ± 2 standard deviations of the mean. That is, a randomly chosen observation drawn from a normal distribution has a 68.3% chance of falling between $\mu - \sigma$ and $\mu + \sigma$. Similarly, there is 95% chance that a randomly chosen observation falls within 2 standard deviations from the mean. For our example, 68.3% of the individuals have an expected diastolic blood pressure between 60-80 mmHg (mean ± 1SD) and 95% have an expected diastolic blood pressure between 50-90 mmHg, approximately (mean ± 2SD).

2. Estimating the standard normal distribution

The standard normal distribution is a normal distribution of mean = 0 and standard deviation = 1 (Figure 4). A random variable with a standard normal distribution is called Z (or Z-score). A Z-score gives us an idea of how far from the mean a data point is.

Figure 4. The standard normal distribution.

This standardization of the normal distribution is useful because it allows us to (a) calculate the probability of a Z-score occurring and (b) compare two Z-scores from different normal distributions. Any normally distributed variable can be standardized for mean = 0 and standard deviation = 1 with the following equation:

\[ \begin{aligned} Z=\frac{Y-\mu}{\sigma},\\\\ \end{aligned} \]

where Z is a normal random variable, Y is an observation, $\mu$ is the mean, and $\sigma$ is the standard deviation.

(a) Probability of a Z-score occurring:

A random sample from the standard normal distribution will have 95% chance to fall between -1.96 and 1.96 (Figure 4). A statistical table helps us to estimate the probability of obtaining a range of values under the curve or the area under the curve (Figure 4). Following our example, say that a new drug for blood pressure is being used. However, this drug has severe adverse effects in individuals with a diastolic blood pressure lower than 40 mmHg and higher than 100 mmHg. So, what is the probability that an adult has a diastolic blood pressure (DBP) lower than 40 mmHg or higher than 100 mmHg?

Here, we followed the addition rule for mutually exclusive events:

\[ \begin{aligned} Pr[DBP < 40\ mmHg\ or\ DBP>100\ mmHg]&=Pr[DBP<40\ mmHg]+Pr[DBP>100\ mmHg] \end{aligned} \]

To solve this, the first step is to standardized 100 mmHg to a Z-score:

\[ \begin{aligned} Z=\frac{100\ mmHg-70\ mmHg}{10\ mmHg}=3\ mmHg.\\\\ \end{aligned} \]

That is, 100 mmHg occurs at 3 standard deviations above the mean. To know the probability of sampling a value equal to or greater than 3, we use the statistical table (statsexamples.com) which indicates a probability of obtaining a Z-score equal to or great than 3 is 1 - 0.9987 = 0.0013. In other words, 0.13% of adults have a diastolic blood pressure higher than 100 mmHg.

Similarly for 40 mmHg, we standardize it and obtain the area under the curve using the statistical table:

\[ \begin{aligned} Z=\frac{40\ mmHg-70\ mmHg}{10\ mmHg}=-3\ mmHg.\\\\ \end{aligned} \]

In this case, 40 mmHg occurs at 3 standard deviations below the mean. Keep in mind that the normal distribution is symmetrical and thus, the probability of being 3 standard deviations below the mean is the same as 3 standard deviations above the mean. When looking at the statistical table, that is a probability of sampling a value equal to or greater than 3 = 0.0013.

Thus,

\[ \begin{aligned} Pr[DBP < 40\ mmHg\ or\ DBP>100\ mmHg]&=Pr[DBP<40\ mmHg]+Pr[DBP>100\ mmHg]\\\\ &=0.0026. \end{aligned} \]

For our example, 0.26% of adults could suffer severe adverse effects from the new drug.

(b) Comparing two Z-scores from different normal distributions:

The standardization allows us to compare Z-scores with no limitations. Take a Z-score of 2.7 mmHg and another from another hypothetical human population of 3.1 mmHg. With no other information, we know that the second population has a Z-score 0.4 values farther from its mean than the first one (3.1 mmHg - 2.7 mmHg). Hence, we can also use standardization to compare the spread of the curve.

3. Estimating the sampling distribution using the normal distribution

In Chapter 4, we studied the population and sampling distributions. The former is the whole set of values we are interested in and the latter is the probability distribution of all values for a sample statistic we might obtain when sampling the population. As we usually do not have data for all individuals in a population, we need to estimate the sampling distribution of the statistic of interest. The normal distribution is perfect for that! That is, if variable Y is normally distributed in the population (in our case the diastolic blood pressure), then the distribution of sample means $\overline{y}$ is also normal (review the central limit theorem in Info-Box!).

In chapter 4, we also explained that $\overline{y}$ is an unbiased estimate of $\mu$. Thus, for our example we already know that the sampling distribution should have a $\overline{y}=70\ mmHg$. The next parameter should be the spread of the curve which is the error around the mean (or the precision of $\overline{y}$) when estimating the sampling distribution, the standard error $SE_\overline{y}$.

Say we are estimating the distribution from a sample of n = 100 individuals. The standard error is: \[ \begin{aligned} SE_\overline{y}&=\frac{\sigma}{\sqrt{n}}\\\\ \frac{10\ mmHg}{\sqrt{100}}&=1\ mmHg \end{aligned} \]

From this equation, it is obvious that the shape of the sampling distribution depends on sample size, n. For our example, the normal sampling distribution for $\overline{y}$ given $\mu$ and $\sigma$ is:

Figure 5. Sampling distribution of ȳ.

With this information, we can estimate the probability of randomly choosing a sample with a mean in a given range (remember this is a continuous distribution) using the standard normal distribution. Of course, we would need to know the true mean ($\mu$) for this.

For our example, say that we want to estimate the probability of drawing a sample with mean > 74 mmHg from this sample of 100 individuals, Pr[$\overline{y}>74\ mmHg$].

First, we estimate the Z-score: \[ \begin{aligned} Z=\frac{\overline{y}-\mu}{SE_\overline{y}}=\frac{74-70}{1}=4 \end{aligned} \]

From a Z distribution statistical table (statsexamples.com), we find that the probability of obtaining a value equal to or greater than 4 is 0.00003. So, about 0.003% of samples of size n = 100 in this population will results in a $\overline{y}$ equal to or greater than 74 mmHg.

Info-Box! The more we sample, the more our observations will tend to a normal distribution. The central limit theorem states that the mean of a large number of measurements randomly sampled approximates a normal distribution even if the measurement is not randomly distributed in the population.

For example, in Chapter 4 we estimated the distribution of age at delivery of rhesus macaque females from a sample n = 500 which clearly is not normally distributed (Chapter 4, Figure 4, see below). However, when estimating the sampling distribution of the mean age at delivery from a simulation of 10,000 samples, such distribution is normally distributed.

Materials and Methods

R and RStudio
R packages ggplot2 and tidyverse
ray.csv

Today’s activity Size structure of round stingrays is organized into one main exercise with multiple challenges to describe the size structure of round stingrays using the normal distribution. These exercises will also motivate inferences about the usefulness of the Z-score standardization to describe biological parameters.

Size structure of round stingrays

Research question 1: What is the size structure of the population of round stingrays in Seal Beach?

1. Import the data Let’s import the “ray” dataset to RStudio and explore it.

Questions:

How many variables and observations does “ray” have?
Considering the research question, what are our variables of interest and what type of variable they are?

2. Estimate the distribution of body sizes.

Let’s first explore the size distribution of round stingrays based on disc width.

# summary of disc width
summary(ray$disc_width)

Challenge: Create two useful data visualizations for this data.

Questions:

Does the distribution of disc width for round stingray follow a normal distribution?
What statistics do we need to test whether the data follow a normal distribution?

3. Fit a normal distribution to the disc width data

Let’s first estimate the sample mean and standard deviation for disc width.

# mean disc width 
m <- mean(ray$disc_width)
m

# standard deviation of disc width
sd <- sd(ray$disc_width)
sd

Using the two statistics needed to define a normal distribution, let’s fit one in ggplot2 using the function stat_function(). The first argument of the function is the type of function, which in this case is the probability density (dnorm), followed by the sample size, and a list of arguments for mean and standard deviation.

# plotting the density probability of disc width
p1 <- ggplot(ray,aes(x=ray$disc_width)) +
geom_density()

p1

# fitting a normal distribution
p2 <- p1 + stat_function(fun = dnorm, n = 2427, args = list(mean = m, sd = sd),colour="red") 

p2

Questions:

Assuming a normal distribution, what range of disc width values make up for 68.3% of the data?
Assuming a normal distribution, what range of disc width values make up for 95% of the data?
What is the standard error (precision) for the mean disc width?

4. Estimate the standard normal distribution for size.

From the equation in Step 3 of the Worked example, we can easily standardize disc width. For this, let’s create a new column in our dataframe ray and add the standardized values. We use the “$” sign to create a new column. Let’s call this new column “z”.

# estimating Z for disc width
ray$z <- (ray$disc_width-m)/sd

# checking the new column in "ray"
head(ray)

# plotting z
p3 <- ggplot(ray,aes(x=ray$z)) +
geom_density()

p3

Now that we have the standard normal distribution for disc width, we can estimate the probability that a captured round stingray has a disc width within a particular range of values. Say we want to estimate the probability of a stingray to have a disc width of or smaller than 13 cm, which is the maximum size for stingrays in the “small” size class.

Let’s learn how to do this in R. For this, we use the function pnorm() which gives us the probability of getting a value equal to or less than Y under the normal curve. The first argument of the function is Y (in our case this value is 13), followed by the mean and standard deviation.

# probability of getting a disc width equal or less than 13 cm
pnorm(13, mean = m, sd = sd)

Question:

What is the probability that a randomly sampled stingray has a disc width of or smaller than 13 cm?

Stop, Think, Do: using the dataset “ray”, estimate the probability that a randomly sampled round stingray in Seal Beach is small, medium or large. Stop and review the variables in the dataset. Think about how to obtain a probability for a range of values under a normal curve. Do the analysis in R and present it!

Discussion questions

Describe the properties of the normal distribution.
Differentiate between the normal and the standard normal distributions.
How does sample size affects the Z-score?

Great Work!

The Normal Distribution

Chapter 7

1. Defining the normal distribution

2. Estimating the standard normal distribution

3. Estimating the sampling distribution using the normal distribution