Today’s investigation. Chemical exposure through consumer products is one of many threats to human and environmental health. However, we still don’t know the toxic risks of many of these products. One way to determine these risks is through controlled lab experiments where cell lines are exposed to different treatments defined by the concentrations of a toxic substance and the effects of these treatments on gene expression are measured. Today, we will learn how to employ a one-way analysis of variance, or ANOVA, to compare more than two group means using gene expression data of gonadotropin releasing hormone (GnRH) from GT1-7 cell lines generated by CSULB Molecular and Ecotoxicology lab.


Introduction

In this lab, we will carry out a one-way analysis of variance to determine differences in gene expression of hypothalmic cells grown under different concentrations of triclosan. Triclosan is an antimicrobial agent added to consumer products to aid in fighting bacteria (e.g., toothpaste, hand soap). Research found triclosan to be toxic and thus it has been banned from numerous products that may be ingested but it is still found in many others such as clothing. Since then, many studies have focused on quantifying triclosan effects on human health. Dr. Erika Holland and colleagues were the first to show that triclosan may alter the transcription of gonadotropin-releasing hormone, which is involved in reproduction by regulating Follicle Stimulating Hormone and Leutinizing Hormone.

Today, we will test whether gene expression for gonadotropin-releasing hormone in hypothalamic cells exposed to different concentrations of triclosan is altered. For this, we will use data from CSULB Molecular and Ecotoxicology lab’s experiments led by Dr. Erika Holland (Figure 1). In her experiments, Dr. Holland grew cells in the lab under different treatments to compare gene expression across groups. So, let’s explore the analysis of variance and how it is a useful statistical tool for describing differences across group means.


Figure 1. Image of fluorescent stained DNA (blue) and protein (gree, left panel) from CSULB Molecular and Ecotoxicology lab work on gene regulation (right panel). Images: Erika Holland.

Figure 1. Image of fluorescent stained DNA (blue) and protein (gree, left panel) from CSULB Molecular and Ecotoxicology lab work on gene regulation (right panel). Images: Erika Holland.


Upon completion of this lab, you should be able to:


References:



Worked example

In Chapter 8, we used the t-test to compare the mean of two groups. However, many times we want to compare the mean of more than two groups. For that we use the analysis of variance or ANOVA. Basically, the ANOVA compares whether individuals drawn from different groups - on average - are more different than individuals drawn from the same group.

Let’s first state the model assumptions:


As seen in Chapter 8, the null hypothesis when testing for differences among different group means is that all means are the same while the alternative hypothesis states that at least one group mean is different from the others:

\(H_0\): There is no difference among group means (\(\mu_1\) = \(\mu_2\) = \(\mu_3\)).

\(H_A\): At least one group mean, \(\mu_i\), is different from the others.


Let’s decompose the ANOVA by first defining the F-ratio and then partitioning the steps to estimate it.

The variance ratio, or F-ratio, is estimated from the ratio between the amount of variation among group means (group mean square) and the amount of variation within groups (error mean square):


\[ \begin{aligned} F=\frac{group\ mean\ square}{error\ mean\ square}=\frac{MS_{groups}}{MS_{error}}.\\\\ \end{aligned} \]

If the null hypothesis is true, the group and error mean squares are similar and thus F would be close to 1. We do not expect it to be exactly 1 because of sampling variability, we are just interested in knowing if they are consistent with 1.


1. Estimating the ANOVA, step-by-step

Say we want to know how much irrigation our plants should receive to increase monthly survival. For this, we setup an experiment with three treatments; plants with weekly irrigation, plants with weekly partial irrigation, and plants with no irrigation.

See the results (mean, standard deviation, and sample size per treatment) in the table below:


Treatment Survival s n
irrigation 0.87 0.047 10
partial irrigation 0.95 0.034 9
no irrigation 0.72 0.045 10


This summary table comes from the data of each plant replicate in the table below:


Treatment Survival
irrigation 0.87, 0.89, 0.78, 0.79, 0.88, 0.90, 0.89, 0.92, 0.90, 0.87
partial irrigation 0.96, 0.94, 0.98, 0.90, 0.95, 0.89, 0.98, 0.98, 0.93
no irrigation 0.69, 0.75, 0.74, 0.81, 0.70, 0.69, 0.71, 0.74, 0.76, 0.65


To estimate the F-ratio, we need to calculate the sum of squares, \(SS_{total}\). The sum of squares \(SS_{total}\) is composed of both sources of variation; variation within groups and variation among group:


\[ \begin{aligned} SS_{total}&=SS_{error}+SS_{groups}\\\\ &=\displaystyle\sum_{i}^{} \displaystyle\sum_{j}^{} (y_{ij}-\overline{y_i})^2+\displaystyle\sum_{i}^{} n_i(\overline{y_i}-\overline{y})^2 \end{aligned} \]


where \(i\) represents a group, \(j\) represents an individual within \(i\), \(\overline{y_i}\) represents the mean of group \(i\), and \(\overline{y}\) represents the grand mean (the mean of all the data across groups). In our example, \(i\) represents treatment, \(\overline{y}\) represents monthly survival, \(n\) represents sample size.


  1. First step: estimate the grand mean, \(\overline{y}\).
\[ \begin{aligned} \overline{y}&=\frac{\displaystyle\sum_{i}^{} n_i\overline{y_i}}{n}\\\\ \end{aligned} \]


For our example, \[ \begin{aligned} \overline{y}&=\frac{10(0.87)+9(0.95)+10(0.72)}{29}=0.843\\\\ \end{aligned} \]

  1. Second step: estimate the group sum of squares, \(SS_{groups}\).


\[ \begin{aligned} SS_{groups}=\displaystyle\sum_{i}^{} n_i(\overline{y_i}-\overline{y})^2\\\\ \end{aligned} \] For our example, \[ \begin{aligned} SS_{groups}&=10[0.87-(0.843)]^2+9[0.95-(0.843)]^2+10[0.72-(0.843)]^2\\\\ &=0.262 \end{aligned} \]


  1. Third step: estimate the error sum of squares, \(SS_{error}\).


From the variance equation, we reformulate \(SS_{error}\) as:


\[ \begin{aligned} SS_{error}&=\displaystyle\sum_{i}^{} \displaystyle\sum_{j}^{} (y_{ij}-\overline{y_i})^2=\displaystyle\sum_{i}^{}s^2(n_i-1) \end{aligned} \]


For our example,

\[ \begin{aligned} SS_{error}&=(0.047)^2(10-1)+(0.034)^2(9-1)+(0.045)^2(10-1)\\\\ &=0.047 \end{aligned} \]


  1. Fourth step: estimate the mean group and error squares, \(MS_{groups}\) and \(MS_{error}\). \[ \begin{aligned} MS_{groups}=\frac{SS_{groups}}{df_{groups}},\\\\ MS_{error}=\frac{SS_{error}}{df_{error}}, \end{aligned} \]

where \(df\) are the degrees of freedom defined by the number of groups - 1 for \(MS_{groups}\) and by the number of observations - the number of groups for \(MS_{error}\). In our example, \(df_{groups}=2\) and \(df_{error}=26\).


For our example, \[ \begin{aligned} MS_{groups}&=\frac{0.262}{2}=0.131,\\\\ MS_{error}&=\frac{0.047}{26}=0.002. \end{aligned} \]


  1. Fifth step: estimate the variance ratio, \(F\). \[ \begin{aligned} F&=\frac{0.131}{0.002}=65.5.\\\\ \end{aligned} \]

As discussed above, if the null hypothesis is true, \(F\) should be close to 1. If not, we expect the variation among groups to be higher than withing groups and thus \(F>1\).


2. Estimating the p-value

To estimate the p-value, we need the sampling distribution for the F-statistic under \(H_0\) (the F-distribution). When examining the F-distribution in a statistical table (statsexamples.com), the critical value corresponding to the area 0.05 in the right tail under the curve of the distribution for \(df=26\) and \(2\) is 3.37. In other words, to the right of 3.37, the area under the curve is 0.05. As our F-ratio is larger than 3.37, it lies farther away in the right tail of the curve. Thus, we reject the null hypothesis; p-value < 0.05. In other words, at least one group mean is different from the others.


3. Estimating the R-squared

Finally, we can estimate the contribution of group differences to the total variation in the data by dividing \(SS_{groups}\) by \(SS_{total}\), where \(SS_{total}=SS_{error}+SS_{groups}\).


For our example, \[ \begin{aligned} R^2&=\frac{SS_{groups}}{SS_{total}}=\frac{0.262}{0.309}=0.848. \end{aligned} \]


Thus, 84.8% of the variation in the data is explained by differences among treatment groups.



Materials and Methods


Today’s activity Toxic substances and gene regulation is organized into one main exercise to describe how the toxic substance triclosan alters GnRH transcription which is involved in reproduction by regulating the Folicle Stimulating Hormone and the Leutinizing Hormone. These exercises will also help us familiarize and interpret R’s model output and model assumption diagnostics.


Toxic substances and gene regulation


Research question 1: Does triclosan suppress gene expression in hypothalamic cells?

1. Import the data

Let’s import the “gene” dataset to RStudio and explore it.


Questions:

  1. How many variables and observations does “gene” have?
  2. Considering the research question, what are the variables of interest?
  3. How many groups will you compare?


2. Carry out a one-way analysis of variance

To estimate an analysis of variance in R, we use the function aov(). The first argument of the function is the response variable and the second argument is the explanatory variable. Lastly, we need to indicate the dataframe where such variables are located.

Let’s fit an ANOVA to our data and check the model output.

# ANOVA
gene_aov <- aov(ct~treatment,data=gene)
gene_aov

# ANOVA table
summary(gene_aov)


Stop, Think: Interpret the model output from R (ANOVA table). Stop and review the steps 1 to 5 of the Worked example. Think about how you can obtain each value in the model output.


Info-Box! Let’s interpret R’s aov() summary output.

# ANOVA table
summary(gene_aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## treatment    2  10.32   5.162   6.213 0.00513 **
## Residuals   33  27.41   0.831                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, treatment are “groups” and residuals are “error” within groups. Following the nomenclature of the Worked example:

Degrees of freedom; Df:

\(Df_{groups}=3-1=2\)

\(Df_{error}=36-3=33\)


Sum of squares; Sum Sq:

\(SS_{groups}=10.32\)

\(SS_{error}=27.41\)


Mean sum of squares; Mean Sq:

\(MS_{groups}=\frac{10.32}{2}=5.162\)

\(MS_{error}=\frac{27.41}{33}=0.831\)


F-ratio; F value:

\(F=\frac{5.162}{0.831}=6.213\)


p-value, Pr(>F):

For \(df\) of 2 and 33 and an F-statistic = 6.213, the \(p-value<0.01\).


**Slight differences do to rounding may be present.


3. Estimate the \(R^2\)

# R squared
gene_R2 <- 10.32/(10.32+27.41)
gene_R2


Question:

  1. How much variation in the data is explained by differences among treatment groups?


Challenge. Plot your data. Add all components for a good visualization.


4. Carry out a multiple comparison using the Tukey test

Now, we know the ANOVA has a p-value < 0.05 but we don’t know which groups are actually different from each other. To compare group means, lets use the Tukey test, a single-step multiple comparison procedure used in conjunction with ANOVA. The Tukey test allows us to find means that are significantly different from each other, comparing all possible pairs of means with a t-test like approach.

For this, we can use the function TukeyHSD() where the first argument is the fitted model.

# Tukey test
t <- TukeyHSD(gene_aov)
t

# Visualization of the Tukey test
plot(t)


Questions:

  1. Are all treatment groups different to each other?
  2. How can you interpret these results in terms of the effect of triclosan on gene expression?
  3. What are the assumptions of the ANOVA?


5. Check model assumptions

To test for model assumptions, we will examine the model residuals using the function residuals(). In an ANOVA, the model residuals (the difference between each observation and the mean value) should be normally distributed.

# extracting model residuals
gene_aov_res <- residuals(gene_aov)
gene_aov_res

# converting aov_res into a dataframe for ggplot
gene_aov_res <- as.data.frame(aov_res)
gene_aov_res

# histogram of model residuals
library(ggplot2)

p <- ggplot(aov_res,aes(x=aov_res)) +
  geom_histogram(binwidth = 0.1)
p


The aov() function prepares the data for model checking plots; a plot of residuals vs fitted values, a Q-Q plot, a scale-location plot, and a constant-leverage plot.

# model checking plots
plot(gene_aov)


Info-Box! Let’s interpret the model checking plots.



Residuals vs Fitted shows the pattern in the residuals. Under the model assumptions, we expect similar variability in each scatter plot.

Normal Q-Q is a scatter plot created by plotting two sets of quantiles against one another. Under the model assumptions, we expect the standardized residuals to be on top of the straight line.

Scale-Location is a scatter plot that tests if the residuals change with the fitted values. Under the model assumptions, we expect no change.


Considering that our sample size is not very large (\(n_{groups}=12\)), and thus there are some values that seem extreme, assumptions on normality seems to hold.


Another way we can test for normality in our data is by employing a Shapiro test. In this test, the null hypothesis states that our model residuals comes from a normal distribution. For this, we can use shapiro.test().

# shapiro test
shapiro.test(gene_aov_res)


Question:

  1. According to the shapiro test, do you reject the null hypothesis? Explain.


6. Test for equal variances

Finally, we can use the Barlett’s test to test for equal variances. For this, we will use the function bartlett.test(). The first argument of the function is the response variable, followed by the explanatory variable (factor).

# testing for equal variances
bartlett.test(ct~treatment, data=gene)  


If the test results in a p-value > 0.05, then the data has equal variance (no difference in variance across factors). However, if the variances are significantly different across factors, then the we have to transform the data. Transforming data means applying some mathematical expression to the response variable in an attempt to meet the assumptions for the ANOVA. You can use the following general command.

# transforming the data using the natural log
transformed <- log(response_variable)  

Once transformed, you can use the function bartlett.test() to see if your transformed variable meets the assumptions.


Discussion questions

  1. Interpret the estimated F ratio.
  2. Interpret the estimated coefficient of determination, R-squared.
  3. Mention two reasons why we test for model assumptions.


Great Work!