Resource Image

Genome Sequence Data in R using Biostrings (Swirl Lesson)

Author(s): Robert E Furrow

University of California, Davis

1328 total view(s), 1297 download(s)

0 comment(s) (Post a comment)

Summary:
By the end of this lesson, students should be able to load FASTA files into R as DNAStringSets and use width() and alphabetFrequency(), combined with other functions like sum() and mean(), to evaluate genome assembly quality and nucleotide…

more

By the end of this lesson, students should be able to load FASTA files into R as DNAStringSets and use width() and alphabetFrequency(), combined with other functions like sum() and mean(), to evaluate genome assembly quality and nucleotide frequencies. 

Description

This swirl lesson aims to familiarize students with DNAStringSets from the Biostrings package in the programming language R. The lesson will build student skills to manipulate and analyze genomic sequence data. By the end of this lesson, students should be able to load FASTA files into R as DNAStringSets and use width() and alphabetFrequency(), combined with other functions like sum() and mean(), to evaluate genome assembly quality and nucleotide frequencies. The swc file is self-contained, and can be used to install the complete lesson using swirl. The lesson plan pdf contains a longer description of the lesson and its context, as well as suggestions for implementation. The lab 2 Rmd and pdf files outline some example material to use leading up to the swirl lesson, and to assess student ability to use the tools. The lab 3 Rmd and pdf contain follow up material with more advanced approaches to working with these DNAStringSets.

The lesson was designed and implemented in a course called Genome Hunters in Spring Quarter 2020 at the University of California, Davis. This course-based research experience guided students through exploratory data analysis on genome assemblies of novel microbes. The materials are aimed at first-year undergraduates with an interest in biology, although the early versions of the class have included students across all four college years. These assemblies can be loaded into R using tools from the Biostrings package, and the genomes can be analyzed easily both with custom functions and with many built-in tools in the package. Although this swirl lesson only introduces the basics of counting nucleotide frequencies and exploring contig lengths, many students went on to use additional tools in the Biostrings package for their individual projects. A full set of the computational lab materials for this class is online here https://bookdown.org/joelrome88/bis23b/, and the labs before and after this swirl homework assignment are included in this resource as RMarkdown and pdf files.

Cite this work

Researchers should cite this work as follows: