--- title: "Module 3R" author: "[type your name here in between the quotation marks]" date: "`r format(Sys.time(), '%d %B, %Y')`" output: word_document: toc: yes html_document: toc: yes theme: cerulean pdf_document: toc: yes --- # Fork in the Road In this bioinformatics adventure, we are going to continue to look amino acid sequences and compute similarity scores using the *BLOSUM62* matrix, but we will take a glimpse at some of the tools that are available in the world of *Bioconductor*. This set of tasks are set after Exercise 3 in the *Sequence Similarity* materials. Load the R markdown file (the file with the .rmd extension) in `RStudio` In this activity, we will practice with `R` code. Some useful Windows keyboard shortcuts that `R` programmers and students like to use while coding include the following (and code blocks are the gray rectangles where the `R` code goes): * create a new code block: CTRL-ALT-I * run a line of code: CTRL-ENTER * run an entire code block: CTRL-SHIFT-ENTER (and replace `CTRL` with `CMD` for Mac) These R markdown files are best viewed in the HTML output (but PDF and Word versions have been supplied for convenience). \newpage # Bioconductor *Bioconductor* is a collection of `R` packages that has been built by several bioinformatics researchers to perform common calculations in their field. In their own words, ``*Bioconductor* provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, and an active user community.'' They keep the project and code documentation on their website at [https://bioconductor.org/](https://bioconductor.org/) ## Installation (This activity assumes that you have already installed `R` and you are advised to also have installed `RStudio`.) If you have not worked with Bioconductor before, the mechanisms for installing packages is different than the usual packages that are hosted by CRAN. In this module, we will be using the `Biostrings` package ([documentation](https://bioconductor.org/packages/release/bioc/html/Biostrings.html)). Copy and paste the following line of code into the `console` (bottom left area of `RStudio`) and then press ENTER to run the installation process. ```{r, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") ``` Next, now that `BiocManager` (the package manager for Bioconductor) has been installed, we can use the double-colon operator (::) to specifically call the `install` function from the `BiocManager` package and install the `Biostrings` package from Bioconductor. Once again, copy and paste the following line of code into the `console` (bottom left area of `RStudio`) and then press ENTER to run the installation process. ```{r, eval = FALSE} BiocManager::install("Biostrings") ``` At this point, you can verify the presence of our new code packages in RStudio by looking in the `Packages` pane (bottom right area in `RStudio`). ## Starting a Bioconductor Session After installing a code package, the `library` command tells `RStudio` that we want to have the `Biostrings` package available for our use presently in the current coding *environment*. Run the following code (either by pressing the green triangle here in RStudio or using the keyboard shortcut). ```{r, message = FALSE} library("Biostrings") ``` \newpage # The Samples We will once again use the [HomoloGene database at NCBI](http://www.ncbi.nlm.nih.gov/homologene) and the following sample of amino acid sequences. ```{r} Homo_sapiens <- "MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE" Macaca_mulatta <- "MGDVEKGKKIFVMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSNTAANKNKGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKREERADLIAYLKKATNE" Bos_taurus <- "MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKMIFAGIKKKGEREDLIAYLKKATNE" Gallus_gallus <- "MGDIEKGKKIFVQKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAEGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKSERVDLIAYLKDATSK" Xenopus_Silurana_tropicalis <- "MGDAEKGKKIFVQKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAEGFSYTDANKNKGIVWDEGTLLEYLENPKKYIPGTKMIFAGIKKKGERQDLIAYLKQSTSS" Danio_rerio <- "MGDVEKGKKVFVQKCAQCHTVENGGKHKVGPNLWGLFGRKTGQAEGFSYTDANKSKGIVWGEDTLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKSATS" Drosophila_melanogaster <- "MGVPAGDVEKGKKLFVQRCAQCHTVEAGGKHKVGPNLHGLIGRKTGQAAGFAYTDANKAKGITWNEDTLFEYLENPKKYIPGTKMIFAGLKKPNERGDLIAYLKSATK" Zea_mays <- "MASFSEAPPGNPKAGEKIFKTKCAQCHTVDKGAGHKQGPNLNGLFGRQSGTTAGYSYSAGNKNKAVVWEEDTLYEYLLNPKKYIPGTKMVFPGLKKPQERADLIAYLKEATA" Saccharomyces_cerevisiae_S288c <- "MTEFKAGSAKKGATLFKTRCLQCHTVEKGGPHKVGPNLHGIFGRHSGQAEGYSYTDANIKKNVLWDENNMSEYLTNPKKYIPGTKMAFGGLKKEKDRNDLITYLKKACE" ``` ## Task 1 1. In previous modules, we talked about similarity and alignment. Upon doing a visual glimpse of the amino acid sequences above, what do you observe about the sequences? Type your answer into the space below. 2. Summarize why we sought out the *cytochrome c* genome in those species. \newpage # BLOSUM62 For convenience, the block substitution matrix `BLOSUM62` is available in the `Biostrings` package. First, we use the `data` function to bring the matrix into the current coding environment. ```{r} data("BLOSUM62") ``` ## Task 2 Copy, paste, and run the following code in the `console` of RStudio (will open the matrix in a new tab); then type your observations about the matrix below and outside the code block. ```{r, eval = FALSE} View(BLOSUM62) ``` \newpage # Pairwise Sequence Alignments In this section, we will get a sense of how the `Biostrings` package code can help us perform pairwise sequence alignments. In addition to taking a pair of sequences for inputs, the `pairwiseAlignment` function has a few parameters that bioinformatics researchers may use to modify their calculations. Copy, paste, and run the following code in the `console` to bring up the *manual* page for this function. Skim over the documentation and notice the variety of parameters (after the "pattern, subject" places for inputs). ```{r, eval = FALSE} ?pairwiseAlignment ``` We will continue to use the block substitution matrix `BLOSUM62` for the calculation of similarity scores. We will also use the `AAString` function along the way (technical note: this function eases handling of much longer sequences and sequences/alignments with gaps). Otherwise, we will use the default parameter settings which include `gapOpening = 10` and `gapExtension = 4`. For example, the following code will perform a pairwise sequence alignment analysis and save the results into a variable called `alignment_result` ```{r} alignment_result <- pairwiseAlignment(AAString(Bos_taurus), AAString(Danio_rerio), substitutionMatrix = BLOSUM62) ``` Let us look at the contents of `alignment_result`. ```{r} alignment_result ``` ## Task 3 1. Run the following code, and then describe in your own words what the `compareStrings` function does. (Hint: you can use `?compareStrings` to bring up the manual page) ```{r} compareStrings(alignment_result) ``` 2. Run the following code, and then describe in your own words what the `pid` function does. ```{r} pid(alignment_result) ``` \newpage ## Task 4 1. Perform a pairwise sequence alignment (with the `pairwiseAlignment`, `compareStrings`, and `pid` functions) on a different pair of sequences that you think are *similar* (in the evolutionary sense). ```{r} ``` 2. Perform a pairwise sequence alignment (with the `pairwiseAlignment`, `compareStrings`, and `pid` functions) on a different pair of sequences that you think are *dissimilar* (in the evolutionary sense). ```{r} ``` \newpage # Multiple Sequence Alignment Finally, run the following code to perform a multiple-sequence alignment on our sample of amino acid sequences. ```{r} shortest_sequence_length <- min(nchar(Bos_taurus), nchar(Danio_rerio), nchar(Drosophila_melanogaster), nchar(Gallus_gallus), nchar(Homo_sapiens), nchar(Macaca_mulatta), nchar(Saccharomyces_cerevisiae_S288c), nchar(Xenopus_Silurana_tropicalis), nchar(Zea_mays)) aama_result <- AAMultipleAlignment(c(Bos_taurus, Danio_rerio, Drosophila_melanogaster, Gallus_gallus, Homo_sapiens, Macaca_mulatta, Saccharomyces_cerevisiae_S288c, Xenopus_Silurana_tropicalis, Zea_mays), end = shortest_sequence_length) rownames(aama_result) <- c("Bos_taurus", "Danio_rerio", "Drosophila_melanogaster", "Gallus_gallus", "Homo_sapiens", "Macaca_mulatta", "Saccharomyces_cerevisiae_S288c", "Xenopus_Silurana_tropicalis", "Zea_mays") print(aama_result) ``` ## Task 5 1. Based on the results of the previous code block and your work with these materials, what complications do you think bioinformatics researchers encounter when wishing to perform a multiple-sequence alignment? \newpage # Wrap-Up * Click the `knit` button (top left of this coding session's area, near the `save` button). This process will combine the `R markdown`, `HTML` code, and `R` code and create an `HTML` file (that can be viewed in any internet browser). * This HTML file is what your instructor will request for the assignment submission. ## References This module of activity was adapted from the materials *Sequence Similarity: An inquiry based and "under the hood" approach for incorporating molecular sequence alignment in introductory undergraduate biology courses* by Kleinschmit, et al. ([CourseSource link](https://www.coursesource.org/courses/sequence-similarity-an-inquiry-based-and-under-the-hood-approach-for-incorporating-molecular#tabs-0-content=0)) Further information about the pairwise- and multiple-sequence alignment `Biostrings` tools can be found in the vignettes ([Bioconductor link](https://bioconductor.org/packages/release/bioc/html/Biostrings.html)) ```{r} sessionInfo() ```