In this bioinformatics adventure, we are going to continue to look amino acid sequences and compute similarity scores using the BLOSUM62 matrix, but we will take a glimpse at some of the tools that are available in the world of Bioconductor. This set of tasks are set after Exercise 3 in the Sequence Similarity materials. Load the R markdown file (the file with the .rmd extension) in RStudio
In this activity, we will practice with R
code. Some useful Windows keyboard shortcuts that R
programmers and students like to use while coding include the following (and code blocks are the gray rectangles where the R
code goes):
(and replace CTRL
with CMD
for Mac)
These R markdown files are best viewed in the HTML output (but PDF and Word versions have been supplied for convenience).
Bioconductor is a collection of R
packages that has been built by several bioinformatics researchers to perform common calculations in their field. In their own words, ``Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, and an active user community.’’ They keep the project and code documentation on their website at https://bioconductor.org/
(This activity assumes that you have already installed R
and you are advised to also have installed RStudio
.)
If you have not worked with Bioconductor before, the mechanisms for installing packages is different than the usual packages that are hosted by CRAN. In this module, we will be using the Biostrings
package (documentation). Copy and paste the following line of code into the console
(bottom left area of RStudio
) and then press ENTER to run the installation process.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
Next, now that BiocManager
(the package manager for Bioconductor) has been installed, we can use the double-colon operator (::) to specifically call the install
function from the BiocManager
package and install the Biostrings
package from Bioconductor. Once again, copy and paste the following line of code into the console
(bottom left area of RStudio
) and then press ENTER to run the installation process.
BiocManager::install("Biostrings")
At this point, you can verify the presence of our new code packages in RStudio by looking in the Packages
pane (bottom right area in RStudio
).
After installing a code package, the library
command tells RStudio
that we want to have the Biostrings
package available for our use presently in the current coding environment. Run the following code (either by pressing the green triangle here in RStudio or using the keyboard shortcut).
library("Biostrings")
We will once again use the HomoloGene database at NCBI and the following sample of amino acid sequences.
Homo_sapiens <- "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE"
Macaca_mulatta <- "MGDVEKGKKIFVMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSNTAANKNKGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKREERADLIAYLKKATNE"
Bos_taurus <- "MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKMIFAGIKKKGEREDLIAYLKKATNE"
Gallus_gallus <- "MGDIEKGKKIFVQKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAEGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKSERVDLIAYLKDATSK"
Xenopus_Silurana_tropicalis <- "MGDAEKGKKIFVQKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAEGFSYTDANKNKGIVWDEGTLLEYLENPKKYIPGTKMIFAGIKKKGERQDLIAYLKQSTSS"
Danio_rerio <- "MGDVEKGKKVFVQKCAQCHTVENGGKHKVGPNLWGLFGRKTGQAEGFSYTDANKSKGIVWGEDTLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKSATS"
Drosophila_melanogaster <- "MGVPAGDVEKGKKLFVQRCAQCHTVEAGGKHKVGPNLHGLIGRKTGQAAGFAYTDANKAKGITWNEDTLFEYLENPKKYIPGTKMIFAGLKKPNERGDLIAYLKSATK"
Zea_mays <- "MASFSEAPPGNPKAGEKIFKTKCAQCHTVDKGAGHKQGPNLNGLFGRQSGTTAGYSYSAGNKNKAVVWEEDTLYEYLLNPKKYIPGTKMVFPGLKKPQERADLIAYLKEATA"
Saccharomyces_cerevisiae_S288c <- "MTEFKAGSAKKGATLFKTRCLQCHTVEKGGPHKVGPNLHGIFGRHSGQAEGYSYTDANIKKNVLWDENNMSEYLTNPKKYIPGTKMAFGGLKKEKDRNDLITYLKKACE"
There are several pieces (subsequences) where the amino acids are the same, but the sequences might not align directly.
The cytochrome c genome is found in the mitochondrial membrane of eukaryotes. Since its function is quite old (in the evolutionary sense), this genome should be common in many eukaryotes.
For convenience, the block substitution matrix BLOSUM62
is available in the Biostrings
package. First, we use the data
function to bring the matrix into the current coding environment.
data("BLOSUM62")
Copy, paste, and run the following code in the console
of RStudio (will open the matrix in a new tab); then type your observations about the matrix below and outside the code block.
View(BLOSUM62)
The matrix is symmetrical (for dealing with pairs of amino acids in the possible substitutions), there are positive numbers for similar amino acids, there are negative numbers for dissimilar amino acids, and the rows and columns are labeled with the one-letter codes.
In this section, we will get a sense of how the Biostrings
package code can help us perform pairwise sequence alignments. In addition to taking a pair of sequences for inputs, the pairwiseAlignment
function has a few parameters that bioinformatics researchers may use to modify their calculations. Copy, paste, and run the following code in the console
to bring up the manual page for this function. Skim over the documentation and notice the variety of parameters (after the “pattern, subject” places for inputs).
?pairwiseAlignment
We will continue to use the block substitution matrix BLOSUM62
for the calculation of similarity scores. We will also use the AAString
function along the way (technical note: this function eases handling of much longer sequences and sequences/alignments with gaps). Otherwise, we will use the default parameter settings which include gapOpening = 10
and gapExtension = 4
.
For example, the following code will perform a pairwise sequence alignment analysis and save the results into a variable called alignment_result
alignment_result <- pairwiseAlignment(AAString(Bos_taurus),
AAString(Danio_rerio),
substitutionMatrix = BLOSUM62)
Let us look at the contents of alignment_result
.
alignment_result
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLH...PKKYIPGTKMIFAGIKKKGEREDLIAYLKKATNE
## subject: MGDVEKGKKVFVQKCAQCHTVENGGKHKVGPNLW...PKKYIPGTKMIFAGIKKKGERADLIAYLKSATS-
## score: 491
compareStrings
function does. (Hint: you can use ?compareStrings
to bring up the manual page)compareStrings(alignment_result)
## [1] "MGDVEKGKK?FVQKCAQCHTVE?GGKHK?GPNL?GLFGRKTGQA?GFSYTDANK?KGI?WGE?TLMEYLENPKKYIPGTKMIFAGIKKKGER?DLIAYLK?AT?"
The common amino-acids are displayed, while a question mark indicates a mismatch (a substitution).
pid
function does.pid(alignment_result)
## [1] 89.42308
The pid
function computes the how identical the sequences are and outputs the percentage of similarity.
pairwiseAlignment
, compareStrings
, and pid
functions) on a different pair of sequences that you think are similar (in the evolutionary sense).alignment_result2 <- pairwiseAlignment(AAString(Homo_sapiens),
AAString(Macaca_mulatta),
substitutionMatrix = BLOSUM62)
compareStrings(alignment_result2)
## [1] "MGDVEKGKKIF?MKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYS?TAANKNKGI?WGEDTLMEYLENPKKYIPGTKMIFVGIKK?EERADLIAYLKKATNE"
pid(alignment_result2)
## [1] 96.19048
pairwiseAlignment
, compareStrings
, and pid
functions) on a different pair of sequences that you think are dissimilar (in the evolutionary sense).alignment_result3 <- pairwiseAlignment(AAString(Homo_sapiens),
AAString(Saccharomyces_cerevisiae_S288c),
substitutionMatrix = BLOSUM62)
compareStrings(alignment_result3)
## [1] "M-----G???KG???F???C?QCHTVEKGG?HK?GPNLHG?FGR??GQA?GYSYT?AN??K???W?E????EYL?NPKKYIPGTKM?F?G?KK???R?DLI?YLKKA?+E"
pid(alignment_result3)
## [1] 60.90909
Finally, run the following code to perform a multiple-sequence alignment on our sample of amino acid sequences.
shortest_sequence_length <- min(nchar(Bos_taurus),
nchar(Danio_rerio),
nchar(Drosophila_melanogaster),
nchar(Gallus_gallus),
nchar(Homo_sapiens),
nchar(Macaca_mulatta),
nchar(Saccharomyces_cerevisiae_S288c),
nchar(Xenopus_Silurana_tropicalis),
nchar(Zea_mays))
aama_result <- AAMultipleAlignment(c(Bos_taurus, Danio_rerio, Drosophila_melanogaster,
Gallus_gallus, Homo_sapiens, Macaca_mulatta,
Saccharomyces_cerevisiae_S288c, Xenopus_Silurana_tropicalis,
Zea_mays),
end = shortest_sequence_length)
rownames(aama_result) <- c("Bos_taurus", "Danio_rerio", "Drosophila_melanogaster",
"Gallus_gallus", "Homo_sapiens", "Macaca_mulatta",
"Saccharomyces_cerevisiae_S288c", "Xenopus_Silurana_tropicalis",
"Zea_mays")
print(aama_result)
## AAMultipleAlignment with 9 rows and 104 columns
## aln names
## [1] MGDVEKGKKIFVQKCAQCHTVEKGGK...KMIFAGIKKKGEREDLIAYLKKATN Bos_taurus
## [2] MGDVEKGKKVFVQKCAQCHTVENGGK...KMIFAGIKKKGERADLIAYLKSATS Danio_rerio
## [3] MGVPAGDVEKGKKLFVQRCAQCHTVE...IPGTKMIFAGLKKPNERGDLIAYLK Drosophila_melano...
## [4] MGDIEKGKKIFVQKCSQCHTVEKGGK...KMIFAGIKKKSERVDLIAYLKDATS Gallus_gallus
## [5] MGDVEKGKKIFIMKCSQCHTVEKGGK...KMIFVGIKKKEERADLIAYLKKATN Homo_sapiens
## [6] MGDVEKGKKIFVMKCSQCHTVEKGGK...KMIFVGIKKREERADLIAYLKKATN Macaca_mulatta
## [7] MTEFKAGSAKKGATLFKTRCLQCHTV...YIPGTKMAFGGLKKEKDRNDLITYL Saccharomyces_cer...
## [8] MGDAEKGKKIFVQKCSQCHTVEKGGK...KMIFAGIKKKGERQDLIAYLKQSTS Xenopus_Silurana_...
## [9] MASFSEAPPGNPKAGEKIFKTKCAQC...PKKYIPGTKMVFPGLKKPQERADLI Zea_mays
At first, these functions that aid in multiple-sequence alignment require that the sequences have the same length. Alignments become exponentially more complicated as more sequences are queried. Researchers may also need to adjust the parameters (such as gapOpening
and gapExtension
along with the substitution matrix itself) to better coincide with the fidelity of the data.
knit
button (top left of this coding session’s area, near the save
button). This process will combine the R markdown
, HTML
code, and R
code and create an HTML
file (that can be viewed in any internet browser).This module of activity was adapted from the materials Sequence Similarity: An inquiry based and “under the hood” approach for incorporating molecular sequence alignment in introductory undergraduate biology courses by Kleinschmit, et al. (CourseSource link)
Further information about the pairwise- and multiple-sequence alignment Biostrings
tools can be found in the vignettes (Bioconductor link)
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Biostrings_2.60.0 GenomeInfoDb_1.28.0 XVector_0.32.0
## [4] IRanges_2.26.0 S4Vectors_0.30.0 BiocGenerics_0.38.0
##
## loaded via a namespace (and not attached):
## [1] crayon_1.4.1 digest_0.6.27 bitops_1.0-7
## [4] magrittr_2.0.1 evaluate_0.14 zlibbioc_1.38.0
## [7] rlang_0.4.11 stringi_1.6.2 rmarkdown_2.8
## [10] tools_4.1.0 stringr_1.4.0 RCurl_1.98-1.3
## [13] xfun_0.23 yaml_2.2.1 compiler_4.1.0
## [16] htmltools_0.5.1.1 knitr_1.33 GenomeInfoDbData_1.2.6