Today’s investigation. Growth and reproduction depend on essential genes encoding particular enzymes. Such genes maintain functions related to DNA replication, metabolism, and protein translation. These observations have motivated studies on the association of genome length and the number of essential genes encoding for fundamental enzymes. They have also inspired questions about the predictive role that genome length has on the number of particular genes of interest. Today we will use correlation and linear regression analyses to test for these associations and predict linear relations using data generated by CSULB Comparative Microbial Genomics Lab.


Introduction

In this lab, we will carry out a correlation analysis and a linear regression analysis to determine associations between microbial genome lengths and the number of essential and non-essential genes encoding for particular enzymes. For organisms with large genomes, and thus more DNA to replicate, it is predicted that they will have multiple copies of essential genes encoding for enzymes that catalyze DNA replication, such as the DNA polymerase III alpha subunit. However, non-essential genes encode for other enzymes - called accessory genes - not integral to life and thus are predicted to not have associations with genome length.

Today, we will test whether these predictions hold using data provided by CSULB Comparative Microbial Genomics lab led by Dr. Renaud Berlemont. In his analyses, Dr. Berlemont used microbial genomic data from the public database PATRICbrc to test predictions on the associations between microbial genomics and the number of genes encoding for certain enzymes with important functional activity. So, let’s explore how we can use correlation and regression analyses to test for associations and linear relations between two variables.



Upon completion of this lab, you should be able to:


Reference:



Worked example

When testing the linear relationship between two numerical variables, we employ correlation and linear regression analyses. For example, when testing associations between two numerical variables, we can determine whether the two variables are associated or correlated, how strong is such association, and in what direction they are associated using correlation analysis. However, if we are interested in predicting the values of one variable from the values of the other variable and determining the rate of change of one value given the other, then we use linear regression analysis. Of course, we often employ both. So let’s work on an integrated example.

As most of the statistical tests discussed in this manual, correlation methods follow assumptions of independence and normality:

Take for example the scatter plots below:
Figure 1. Illustration of potential correlations between two numerical variables (n = 12) with their corresponding coefficients of correlation $r$ (red).

Figure 1. Illustration of potential correlations between two numerical variables (n = 12) with their corresponding coefficients of correlation \(r\) (red).


Scatter plots can help us detect correlations between numerical variables; whether they are strongly associated and in what direction.

Take for instances, Figure 2A. As x increases, y increases with a relatively low amount of “scatter”. Thus, scatter plot A shows a relatively strong positive correlation. Similarly, scatter plot B shows a relatively strong correlation but it is inverse. That is, as x increases, y decreases. Finally, scatter plot C shows a very weak or no correlation at all. That is, as x increases, y changes unpredictably, or at least not along a line. Of course, the correlation between variables can be quantified using the correlation coefficient r.


1. Estimating the correlation coefficient \(r\)

The linear correlation coefficient, r, measures the tendency of two numerical variables to change together along a line and is defined as: \[ \begin{aligned} r=\frac{\sum_{i}(x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum_{i}(x_i-\overline{x})^2}\sqrt{\sum_{i}(y_i-\overline{y})^2}}, \end{aligned} \]


where \(x\) and \(y\) are the two numerical variables with \(i\) observations.


Here, \((x_i-\overline{x})(y_i-\overline{y})\) represents the sum of products or the product of the deviation of \(x\) and \(y\) from their mean value, \(\overline{x}\) and \(\overline{y}\). These deviations can be positive (when observations are larger than the mean value) or negative (when observations are smaller than the mean value), and thus give information of the direction of the correlation. The denominator is the sum of squares for \(x\) and \(y\), as seen in Chapter 9, under a square-root.

The correlation coefficient \(r\) takes values between -1 and 1. The closer \(r\) is from either -1 or 1, the stronger the correlation or the more the observations will lie under a line.


Let’s estimate \(r\) for the Figure 2A. Here, our hypotheses are:


The data from Figure 2A is in the table below.

Variable Observations
X 3, 6, 5, 8, 7.3, 3, 4, 6, 4, 5.8, 7.6, 5.4
Y 6, 10, 7, 12.1, 10.2, 4.3, 5.2, 10, 6.5, 7.8, 12.3, 8.7

For our example, \[ \begin{aligned} \sum_{i}(x_i-\overline{x})(y_i-\overline{y})&=46.618\\\\ \sum_{i}(x_i-\overline{x})^2&=31.683\\\\ \sum_{i}(y_i-\overline{y})^2&=76.049 \end{aligned} \] Thus, \[ \begin{aligned} r=\frac{46.618}{\sqrt{31.683}\sqrt{76.049}}=0.95 \end{aligned} \]

Therefore, \(x\) and \(y\) in our example are strongly and positively correlated.


2. Estimating the standard error of \(r\)

To estimate how precise is our estimate of \(r\), we use the standard error.


\[ \begin{aligned} SE_r=\sqrt{\frac{1-r^2}{n-2}} \end{aligned} \]

For our example, \[ \begin{aligned} SE_r=\sqrt{\frac{1-(0.95)^2}{24-2}}=0.067 \end{aligned} \]


3. Testing the hypothesis for correlation


Finally, to test our null hypothesis, we can use the \(t\)-statistic and the Student’s \(t\)-distribution with \(df=n-2\): \[ \begin{aligned} t=\frac{r}{SE_r} \end{aligned} \] For our example, \[ \begin{aligned} t=\frac{0.95}{0.0667}=14.179 \end{aligned} \]


Using a statistical table for the \(t\)-distribution (statsexamples.com), we can see that the null expectation for a critical value of 0.025 (remember that this is a two-tailed test) is \(t=2.07\) with 22 degrees of freedom. As our estimated \(t\) is much more extreme, \(p\) is less than 0.05 and we reject the null hypothesis.


4. Estimating the linear regression

If we are interested in predicting the value of variable \(y\) given a value of variable \(x\) and quantify the rate of change in variable \(y\) with \(x\), then we need to fit a linear regression to our data.


Here, our hypotheses are:


Let’s fit a linear regression to each dataset in Figure 1.
Figure 2. Fitted linear regressions (red) between x and y (n = 12). The regression is defined by the linear equation presented.

Figure 2. Fitted linear regressions (red) between x and y (n = 12). The regression is defined by the linear equation presented.


Each of these regressions is defined by a linear equation: \[ \begin{aligned} y=bx+a, \end{aligned} \] where \(y\) is the response variable, \(x\) is the explanatory variable, \(b\) is the slope and \(a\) is the \(y\)-intercept. These fitted lines are determined by the method of least squares. For this, we need to find the line with the least (squared) deviations in the \(y\)-axis between the data points and the regression line. As observed in Figure 2, a positive slope indicates a positive linear relationship; as \(x\) increases, \(y\) increases (Figure 2A). A negative slope indicates a negative linear relations; as \(x\) increases \(y\) decreases (Figure 2B). On the other hand, a nearly horizontal line with \(b\) close to 0 indicates no relationship (Figure 2C).


Let’s explore how to estimate the two parameters of the linear regression; (1) the slope and (2) the intercept.


A. Estimating the slope:

The slope is defined as \[ \begin{aligned} b=\frac{\sum_i(x_i-\overline{x})(y_i-\overline{y})}{\sum_i(x_i-\overline{x})^2}, \end{aligned} \]


where \(\overline{x}\) and \(\overline{y}\) are the sample means of \(x\) and \(y\), and \(x_i\) and \(y_i\) correspond to the measures of each variable belonging to individual \(i\). As observed in the formula for the coefficient of correlation \(r\), the numerator here is the sum of products (46.618). The denominator is the sum of squares of \(x\), as observed in Chapter 9.

For our data in Figure 2A, the slope equals to:

\[ \begin{aligned} b=\frac{46.618}{31.683}=1.471 \end{aligned} \]


In our example from Figure 2A, \(y\) increases by 1.471 per one unit increase of \(x\). The units of the slope are \(y\)-units per \(x\)-units.


B. Estimating the intercept \(a\):

The intercept is defined as \[ \begin{aligned} \overline{y}&=a+b\overline{x}\\\\ a&=\overline{y}-b\overline{x} \end{aligned} \] This equation can be used given the fact that any linear regression crosses both \(\overline{x}\) and \(\overline{y}\) or the point (\(\overline{x},\overline{y}\)).

For our example, \[ \begin{aligned} a&=8.34-(1.471)(5.425)=0.36 \end{aligned} \]


Thus, the formula predicting \(y\) from \(x\) in our example is \(y=0.36+1.471x\).


5. Estimating the standard error of the slope

As learned in in previous chapters, the standard error tells us the precision of our estimate of \(b\). The standard error is defined as


\[ \begin{aligned} SE_b=\sqrt{\frac{MS_{residual}}{\sum_i{(x_i-\overline{x})^2}}}, \end{aligned} \]

where the variance of the residuals \(MS_{residuals}\) is defined as


\[ \begin{aligned} MS_{residuals}=\frac{\sum_i{(y_i-\overline{y})^2-b\sum_i({x_i-\overline{x})(y_i-\overline{y})}}}{n-2} \end{aligned} \]


For our example in Figure 2A, \(MS_{residuals}\) is


\[ \begin{aligned} MS_{residuals}=\frac{76.049-(1.471)(46.618)}{10}=0.747 \end{aligned} \]


Thus, \(SE_b\) is \[ \begin{aligned} SE_b=\sqrt{\frac{0.747}{31.683}}=0.154 \end{aligned} \]


For our example in Figure 2A, \(b=1.471\pm0.154\) \(y\)-units per \(x\)-units.


6. Testing the hypothesis of the slope

To test our null hypothesis, we can use the \(t\)-statistic with \(df=n−2\):


\[ \begin{aligned} t=\frac{b-\beta_0}{SE_b} \end{aligned} \]


Because our null hypothesis is that the slope of the regression is zero, then \(\beta=0\): \[ \begin{aligned} t=\frac{1.471-0}{0.154}=9.552 \end{aligned} \]


Using a statistical table for the \(t\)-distribution (statsexamples.com), we can see that the null expectation for the critical value 0.025 is \(t=2.23\) with 10 degrees of freedom. As our estimated \(t\) is more extreme, \(p\) is less than 0.05 and we reject the null hypothesis.



Materials and Methods


Today’s activity Functional associations in the genome is organized into one exercise and one challenge to determine associations between microbial genome lengths and the number of genes coding for essential enzymes. These exercises will help us familiarize and interpret R’s model output for correlation and linear regression.


Functional associations in the genome


Research question 1: Do large genomes have higher number of genes encoding DNA polymerase III alpha subunit?

DNA polymerase III alpha subunit is an important enzyme for DNA replication. We would expect at least one copy of this gene per genome. If there is a lot of DNA to replicate (i.e., large genome), it would be expected to have multiple copies of this gene because more duplicated genes would produce more DNA polymerase, making the replication process faster. Let’s test whether there is an association between genome length and the number of genes encoding the DNA polymerase III alpha subunit.

1. Import the data

Let’s explore the “genomics” dataset.

# importing the gene dataset
gene <- read.table("genomics.txt",header=TRUE,sep="\t",stringsAsFactors = TRUE)

gene


Questions:

  1. Considering the research question, what are the variables of interest?
  2. What type of variables are they and what visualization can be done?


2. Plot the data

Challenge 1. Plot the data with an appropriate visualization.


Questions:

  1. Is there any pattern in the data?
  2. Make a prediction about a potential correlation in the data.


3. Estimate the correlation coefficient

We can estimate the correlation coefficient between the number of DNA polymerase III alpha subunit genes and the genome length using the function cor(), where the first two arguments are the variables of interest and the third argument is the method. Here, we will apply Pearson’s correlation coefficient.

# Pearson correlation
# Pearson correlation
gene1_cor <- cor_test(DNApol_III_a_sub, N_Gene_tot, data=gene)

#correlation coefficient
r <- gene1_cor$cor

r


We can also visualize the correlation coefficient for all the possible combinations of variables. The R package corrplot has many visualizations for correlation matrices where we can plot all correlations from a dataset for more efficient visual explorations. For that, let’s convert “gene” into a matrix using the function as.matrix() and unselect non-numeric columns (i.e., Strain) from “gene”.

# converting gene into a matrix
gene2 <- as.matrix(subset(gene,select=-c(Strain)))

# correlation coefficient
gene3 <- cor(gene2,method="pearson")

# installing corrplot
install.packages("corrplot")

# loading package
library(corrplot)

# "corrplot" with different visual methods
corrplot(gene3, method = "circle")
corrplot(gene3, method = "color")
corrplot(gene3, method = "color",type="lower")


4. Estimate the standard error of \(r\)

Let’s now estimate the precision of r.

# sample size
n <- 1351
n

# standard error of r
r_se <- sqrt((1-r^2)/(n-2))
r_se


5. Test the hypothesis using the \(t\)-test

Finally, let’s test our hypothesis with the student’s t-test using the previous correlation analysis done with the function cor_test().

# t-test for correlation analysis
gene1_cor


Question:

  1. Do you reject the null hypothesis? Use a statistical table.


6. Estimate the regression

We can also fit a linear regression to test whether the number of genes encoding DNA polymerase III alpha subunit can be predicted by the length of the genome. For this, we use the function lm() where the first argument is the response variable, followed by the explanatory variable (separated by “~”), and the dataframe.

# linear regression
lm1 <- lm(DNApol_III_a_sub~N_Gene_tot, data=gene) 

# summary of the model output
summary(lm1)


Question:

  1. Interpret the model output.
  2. Write down the linear equation using the model coefficients.


Finally, we can visualize the linear regression using the function geom_smooth() in ggplot2. Let’s add the linear regression to our plot in Step 2.

# fitting a line to p2 to visualize patterns 
p2 <- p1 + geom_smooth(method="lm",se=TRUE) 

p2


Question:

  1. What is the predicted change in the number of DNA polymerase III alpha subunit genes as the genome length changes?


Info-Box! Let’s interpret R’s lm() summary output.

## 
## Call:
## lm(formula = DNApol_III_a_sub ~ N_Gene_tot, data = gene)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4507 -0.5351 -0.1494  0.3034 13.3006 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.346e-01  6.006e-02   12.23   <2e-16 ***
## N_Gene_tot  1.799e-04  1.334e-05   13.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8887 on 1349 degrees of freedom
## Multiple R-squared:  0.1188, Adjusted R-squared:  0.1181 
## F-statistic: 181.8 on 1 and 1349 DF,  p-value: < 2.2e-16

Y intercept: \[ \begin{aligned} a&=\overline{y}-b\overline{x}=0.7346 \end{aligned} \]


Slope: \[ \begin{aligned} b=\frac{\sum_i(x_i-\overline{x})(y_i-\overline{y})}{\sum_i(x_i-\overline{x})^2}=0.0001799 \end{aligned} \]

Standard error of the slope:

\[ \begin{aligned} SE_b=\sqrt{\frac{MS_{residual}}{\sum_i{(x_i-\overline{x})^2}}}=0.00001334 \end{aligned} \]


\(t\)-statistic: \[ \begin{aligned} t=\frac{b-\beta_0}{SE_b}=13.48 \end{aligned} \]


The p-value is < 0.0001.


Challenge 2. Carry out your own analysis! The tRNA-synthetase are enzymes that attach amino acids to the tRNA. Amino acids are the building blocks of proteins and organisms need at least 20 of these enzymes; 1 for each of the 20 existent amino acids. However, some organisms have been described to have more than 20 tRNA-synthetases. Researchers suspect that having more than 20 tRNA-synthetases supports a “faster” protein production. (1) Generate a related research question and (2) test it with your new gained skills, and (3) plot your data!


Questions:

  1. What is your research question?
  2. Does the data support your null hypothesis?


Discussion questions

  1. Why would you want to fit a linear regression to your data?
  2. How would a scatter plot of two associated variables presenting a low standard error of r look like?
  3. Does a strong correlation imply cause and effect?


Great Work!