Lesson

Learning R for Biologists: A Mini Course Grab-Bag for Instructors

Author(s): Amanda D. Clark1, Laurie S. Stevison*1

Auburn University

Editor: Srebrenka Robic

Published online:

Courses: Anatomy-PhysiologyAnatomy-Physiology BioinformaticsBioinformatics EcologyEcology GeneticsGenetics

Keywords: R data analysis RStudio

2272 total view(s), 351 download(s)

to access supporting documents

Abstract

Resource Image

As biology becomes more data driven, teaching students data literacy skills has become central to biology curriculum. Despite a wealth of online resources that teach researchers how to use R, there are few that offer practical laboratory-based exercises, with teaching resources such as keys, learning objectives, and assessment materials. Here, we present a modular set of lessons and lab activities to help teach R through the platform of RStudio. Both software applications are free and open source making this curriculum highly accessible across various institutions. This curriculum was developed over several years of teaching a graduate level computational biology course. In response to the pandemic, the class was shifted to be completely online. These resources were then migrated to GitHub to make them broadly accessible to anyone wanting to learn R for the analysis of biological datasets. In the following year, these resources were used to teach the course in a flipped format, which is the lesson plan presented here. In general, students responded well to the flipped format, which used class time to conduct live coding demos and work through challenges with the instructor and teaching assistant. Overall, students were able to use these skills to practice analyzing and interpreting data, as well as producing publication quality graphics. While the modules presented range from very basic, doing simple summary statistics and plotting, to quite advanced, where R is integrated onto the command line, teachers should feel free to pick and choose which elements to incorporate into their own curriculum.

Primary Image: R‐Mini‐Course: An Introduction to R. The primary image was generated with BioRender to be a small representation of the applicability of R that we cover in our course.

Citation

Clark AD, Stevison LS. 2023. Learning R for Biologists: A Mini Course Grab-Bag for Instructors. CourseSource 10. https://doi.org/10.24918/cs.2023.12

Society Learning Goals

Bioinformatics
  • Computation in the life sciences
    • What is the role of computation in hypothesis-driven discovery processes within the life sciences?
    • What computational concepts are important in bioinformatics?
    • What statistical concepts are important in bioinformatics?
  • Computational Skills
    • What higher-level computational skills can be used in bioinformatics research?

Lesson Learning Goals

Students will:
  • become comfortable in the RStudio working environment.
  • appreciate the importance of reproducibility in science.
  • understand basic programming and statistical concepts.
  • appreciate the importance of R as a platform for data manipulation and graphics through various applications to biological datasets.

Lesson Learning Objectives

  • Students will practice the art of reproducibility using R scripts.
  • Students will practice conducting basic summary statistics in R.
  • Students will practice manipulating files in R.
  • Students will practice performing basic statistical tests in R.
  • Students will be exposed to programming constructs in R (e.g., loops, conditionals, etc.).
  • Students will be exposed to using R on the command line.
  • Students will practice exploring data visually and make basic plots in R.
  • Students will be exposed to advanced graphics tools in R.
The mini course contains learning objectives specific to each module/activity.

Article Context

Introduction

Massive amounts of data from a wide range of scientific fields have been and are still being generated (13). With the scale of data, the art of reproducibility becomes key to the scientific process. The application of statistical analysis and appropriate visualization is crucial to making sense of large, complex data, and drawing reasonable conclusions about the data and its underlying structure. Today, these data analysis steps are often a bottleneck in research, frequently due to limited statistical, visualization, and/or programming skills taught as a formal part of the curriculum (4, 5). As the deluge of data continues, it is important to cultivate skills for properly manipulating, analyzing, and interpreting “big data” that biologists face today. Introducing data analysis skills as early as the undergraduate level benefits individual students by encouraging the developing of critical evaluation skills when consuming statistical methodology and data visualizations in the literature (68). These skills also help students think deeply about their own data, encouraging them to pursue creative methods of analysis and visualization to communicate their research findings without misrepresenting the data. In a larger context, research itself will progress and greatly benefit from more individuals learning, developing, and applying analytical skills to existing and expected data, alike.

The R and RStudio environment (9, 10) provide a solid introduction to these skills without embarking on the steep learning curve of delving directly into Linux. As part of a larger course titled ‘Introduction to Computational Biology’, which is offered to advanced undergraduate and graduate level students, we have developed content that meets this need for instructors to use. As a result of the COVID-19 pandemic, we shifted the course to be fully online in Fall 2020, allowing us to adapt the last third of this semester-long course into a self-paced mini course focused on introductory data analysis and visualization. We then migrated this mini course to GitHub (Figure 1) in a modular format so that instructors can have a “grab-bag” of instructional material and assessments that we have made publicly available via GitHub (11). In Fall 2021, we used the pre-recorded videos to adopt a flipped format upon returning to in person learning. It is this most recent iteration of the course that we describe here.

To give students context for the volume and variety of today’s data, we include a variety of datasets to students with datasets built into RStudio, as well as publicly available, class-generated, randomly generated, and genomic datasets. We built our course material around R because it is a widely used programming language that is freely available across all computing platforms (Mac, PC, and Linux) that specializes in both statistical and visual data analysis. R can be accessed via RStudio, an integrative development environment (IDE), which packages R into a GUI-based environment that we found to be more familiar to most students and ideal for introducing the R programming language.

Our mini course is not the only self-paced, free R tutorial available online. Notably, swirl is an R package from swirlstats where you “learn R, in R” (12). We appreciated and adopted this same concept in our course and suggest it as a prerequisite to our mini course. We also found a web-based tutorial by Hasse Walum and Desirée De Leon that covers R programming and statistics basics, as well as the popular visualization package “ggplot2” (13). We found their tutorial both creative and relevant to our student audience and suggest it as an additional prerequisite to our mini course. There are several others R tutorials, but many of the available online R courses are constrained to one facet of R (focuses strictly on programming, statistics, or visualization), lack accompanying live-coding demonstrations and tutorials within an R environment, or lack relevant exercises and data sets to apply what was absorbed. The more comprehensive options usually burden students or educators with access costs, but see (14). We were motivated to develop and package the R-focused modules of our course to contribute to freely available R resources that provide comprehensive and relevant approaches to biological data analysis and can be easily adapted for any relevant course, or even scaffolded across courses in a broader curriculum.

Intended Audience

The course material was generated for upper-level undergraduate and first year graduate students within the sciences. In Fall 2021, our student make-up was 70% graduate and 30% undergraduate level students. Although this was the fourth time teaching this course, this was the first time we offered this course in a flipped format, leveraging the materials that had been developed to make it fully online the previous year. We see this mini course as having multiple intended audiences, and hope that the material is exploitable and informative to any interested audience, including independent, self-motivated users. First, instructors interested in developing lower-level assessments that use R and RStudio in introductory courses could follow the content on their own to feel more comfortable adapting it for their courses. Second, many new faculty may be asked to develop courses that cover these topics and could adapt this set of activities and tutorials into their courses to reduce the up-front labor. While they may not have students watch our provided videos, they could use our videos as a guide for developing their own lectures following a very similar format of an interactive walk-through demo. Third, we see the Github site as an excellent resource for independent undergraduate researchers in any biology lab to get an introduction to these skills and to the self-paced nature of research without a significant investment from faculty mentors. The assessments include keys that faculty could use to gauge proficiency at the end of the course. Finally, for instructors interested in developing a Course-Based Undergraduate Research Experience (CURE), the materials presented here would help jump start a semester CURE where students use data to perform an authentic research project throughout the remainder of the semester.

Required Learning Time

Multiple course periods were necessary to cover the material, but, due to the innate structure of the R unit, instructors could select which components to adapt and integrate into their course materials (Table 1). We provided students with 5–35 minute live-coding demonstration videos prior to class. We used scheduled course time (1 hour and 15 minutes twice per week) to address questions and challenge students to apply concepts that were demonstrated in the coding demos or to conduct lab activities. All lab activities were accompanied by handouts with application and reflection questions. We covered the entirety of the material over a 4-week period, with two in person meetings per week. The R mini course now on GitHub is also modular and completely self-paced; therefore, users can work through it in one week or one semester. Each module has a separate page that includes instructional videos totaling 2.5 hours, written materials, or handouts, and accompanying activities to engage users in a hands-on learning experience.

Table 1. Lesson timeline. The R unit of our course was split across four weeks. Following a basic introductory video to R, there were six pre-recorded R walk-through tutorials covering various topics. We also designed three lab-based activities to correspond to the material in the walk-through videos. This material is also available on GitHub as an R mini course (11), which is split into seven individual modules, each with supplemental readings and the videos referenced here. Depending on how in-depth instructors would like to go in their courses, they can select which modules and activities to adapt for their classrooms.

Activity Description Estimated Time Notes
Preparation for Class
Download R and RStudio Have students download both R and RStudio 10 minutes Both software applications are free and compatible across all computing platforms. Alternatively, instructors can work with IT to have VMs or Dockers setup ahead of time.
Basic R tutorials Two links to outside RStudio tutorials are provided 30–45 minutes each These two are fabulous for novices and we felt were better than trying to create something new.
Week 1
Introductory Video (Module 1 in GitHub) R and the RStudio Environment 5 minutes This video introduces the layout of RStudio and shows students the environment they will be using throughout the mini course.
Lab-based activity (Activity #1 in GitHub) Plotting coverage along a contig 1.5 hours A 2-page handout, the dataset, and a key are supplied. Students will learn basic summary statistics and data manipulation as well as graphically exploring a dataset. Students will make an Rscript.
Walk-through video tutorial (Module 2 in GitHub) Students learn how to conduct basic summary statistics in R 30 minutes Two datasets are provided along with the code from the video. Students should follow along to make sure they feel comfortable applying the functions covered.
Week 2
Walk-through video tutorial (Module 3 in GitHub) Data manipulation in R 20 minutes Using one of the same datasets from Module 2 and a built-in R dataset, students learn to manipulate datasets in R. Students learn how to install an R package.
Walk-through video tutorial (Module 4 in GitHub) Advanced Statistical Concepts in R 25 minutes Using a class generated dataset, students will learn how to deal with factor data and subset data in different ways. Students will learn how to fit statistical models and extract p-values.
Week 3
Lab-based activity (Activity #2 in GitHub) Practice Graphing in R 1.5 hours Students will work interactively in RStudio to do a walk-through tutorial that will show them how to manipulate and customize graphs in R. Reinforces concepts from video walk-throughs.
Walk-through video tutorial (Module 5 in GitHub) Advanced Graphing in R (this video was also available to students in Week 2 for those that wanted to get ahead) 20 minutes Using a dataset from previous videos, students will learn how to manipulate graphs in R to produce publication quality images and write to a file.
Walk-through video tutorial (Module 6 in GitHub) Using R on the Command Line 35 minutes Students will work with the data from the first lab activity to make a script that can be executed on the command line. This requires some shell experience from the students and teachers. Loess smoothing is also covered, though the video could be shortened to remove this part.
Week 4
Walk-through video tutorial (Module 7 in GitHub) Programming in R 15 minutes Students learn how to apply programming concepts and make functions. Data used is from a 3-point mapping experiment.
Lab-based activity (Activity #3 in GitHub) R on a supercomputer – building a pipeline 2 hours Students will use the script generated in the previous walk-through video to upload to a supercomputer. They will then design a script that runs the R code on multiple files to generate multiple graphics. Jobs should be submitted to a queueing system and resulting images downloaded.

 

Prerequisite Student Knowledge

Since this is taught as part of a biology curriculum, we have a general expectation that students know some basic introductory biology concepts. These include knowledge of chromosome structure (e.g., that it is a contiguous molecule of DNA with features such as centromeres and telomeres), genetic mapping (e.g., that recombination occurs during meiosis and how to calculate recombination frequency), and allometry (that different body metrics often correlate with one another).

Additionally, this course has a pre-requisite of a lower-level introductory statistics course. Still, we have seen students without this pre-requisite perform well in the course. The main concepts we hope students understand already include the ability to read and interpret graphs of data with x and y axes, such as a histogram, and understand basic summary statistics such as mean and standard deviation. We try to assume little to no knowledge coming into the course, but in actuality the range of abilities is quite broad across students. Our biggest advice is to make sure students have the right attitude when approaching this content. It is quite different from other Biology courses, and therefore can be intimidating. We spend time at the beginning of the course asking students to have a growth mindset in their approach to the content, a method shown to improve student achievement in mathematics (15). We also included former student testimonials to support students with the idea of preventing them from being overwhelmed at the outset.

The R course material was integrated into a curriculum designed to introduce computational methods that are frequently employed in biological sciences. Before the R-focused unit of the course, we imparted knowledge foundational to working with large and complex raw biological data. This included navigating on the command line, using virtual machines, and high-performance computing clusters (HPC). These skills are only necessary for the most advanced activity of the R course material, particularly navigating the command line on an HPC. However, as this is a grab-bag, instructors can opt not to use that module if they do not intend to teach command line or HPC skills.

Prerequisite Teacher Knowledge

Instructors planning to adapt these modules into their course should feel confidence with R and the RStudio environment. They should also make sure they are confident in the specific learning objectives covered in the selected modules that they would choose to integrate into their courses or labs. It would be advantageous to have knowledge in basic genetics, ecology, and bioinformatics, as well as their associated data structure, for broader application across biological fields inherent to the datasets included in the course material.

Scientific Teaching Themes

Active Learning

Students were engaged in active learning throughout the entire R unit of the course via discussion and reflection questions, think-pair-share activities combined with the Frayer model, and brainstorming activities. Students of varying levels of experience and confidence in respect to computational skills were assigned to diverse groups to encourage peer tutoring. We generated discussion questions within the Canvas course management system that required students to write their own responses as well as evaluate and respond to the ideas posted by their peers. To further increase course connectivity and discussion, we generated an organization on Microsoft Teams (Teams) for the entire class and the smaller groups to coordinate, share results, and get feedback from instructors and peers.

Assessment

To assess the student’s grasp of the learning objectives, a vocabulary and concept quiz based on reading assignments and pre-module activities was given. Lab assignments were also a method of assessment due to the accompanying reflection questions and action items required for completion. Submission of the lab was considered full credit regardless of completeness and a key provided after the deadline. Students were expected to check their own answers against the key to make sure they were proficient in the learning goals. The goal of this method was to shift student focus from grade performance to mastery of learning objectives, while potentially enforcing a better understanding of gaps in knowledge. At the end of the module, students were challenged with an independent capstone project that assessed the comprehension and application of the learning objectives. This final assignment is not included in the GitHub course but is available upon request to interested instructors.

Inclusive Teaching

Due to complete conversion to online instruction in the preceding year that this course was taught, the material was already adapted for an online format. This made the transition to a flipped format seamless. Students were provided with videos of each lecture with integrated active coding that they could consume at their own pace. The videos were edited for length and captioned to make lectures as accommodating as possible. These video resources were essential to our international student population, as they were able to consume lectures at their desired tempo and could view and download captions.

The R unit of our course was taught at no additional cost to the student (apart from tuition and fees), removing any associated financial barriers. Both R and RStudio are free software applications that can be installed locally onto student computers. This allows them to complete the bulk of their coursework removed from the internet if needed. Although we have a suggested text that is not free, students were provided with alternatives that are completely free and guidance for accessing the text in our university library. Our university library also provides students with well-equipped computer access to complete assignments. We value free educational resources, motivating the online mini course now publicly available on GitHub, with links to additional free learning resources. The mini course is self-paced and accessible to anyone at any time, allowing any learner to gain experience in using R for data analysis and visualization.

Finally, although we have ended our mini course with the use of an HPC that is available to instructors in our state, we realize that not all instructors/students will have access to such a resource in their own state or through their institutions. Still, it is worth noting that there are several free platforms for instructors to gain access to these resources should they wish to adopt this final section of the course. For example, CyVerse allows instructors access to HPC where they can extend access to their students. Additionally, it is worth noting that CyVerse also provides access to R through an online RStudio interface that could also be used in lieu of the local installation we used. The downside of this approach is that it would require internet access, and that students would lose access after the semester. However, the benefit would be that the instructor could preset the environment and pre-load all of the files/packages to reduce time spent in the classroom troubleshooting personal computer issues of individual students. This could be very beneficial if assistance in the form of TAs or learning assistants are limited or class sizes are exceptionally large.

Lesson Plan

Overview

The R-focused unit required multiple class periods, for a total of 4 weeks of material. This introductory R unit came after three course units focused on an introduction to the command line for our students, but the R unit is stand-alone and can be taught independent of any other course material. The R mini course that was migrated to GitHub has been set-up in a modular format as well, to make each topic stand-alone. Our course was taught in a “flipped” format, where students were expected to watch lectures asynchronously, and class time was used as an open time to apply and practice concepts and skills learned from recorded lectures or conduct lab activities. These pre-recorded lectures have been uploaded to YouTube and are linked within the GitHub R mini course.

Pre-Module Requirements

Students were provided with 149 minutes of demonstrative video lectures over 6 recordings with an average time of 25 minutes per video (minimum 5 minutes; maximum 35 minutes). These videos covered data import and export, data manipulation, data summary statistics, basic statistical analysis, and data plotting and visualization. In addition to these topics, we reinforced using scripts for improved reproducibility as well as programmatic concepts of functions and if/else statements for flow of control in R. We encouraged students to review lecture videos that were available to them prior to attending associated in-class lab activities. Pre-recorded lecture videos were recorded using Panopto and a subset contained embedded “concept check” questions with answer explanations to help students digest and reflect on video materials. These are not included in the YouTube uploaded videos but have been provided as a separate table (Supporting File S1). We also provided students with the datasets and code in RMarkdown format with each video so that they could walk through the code and analyses as they watched lectures. Students were provided with alternative syntax, when possible, to demonstrate the built-in redundancy for performing certain tasks in R or to compare and contrast the methods. Relevant readings from the text “R in Action” were suggested at the beginning of each week (16). Later modules apply concepts and generate figures that require a considerable volume of computational resources (i.e., memory, processors, and time) that may not be available on personal computers. Access to high-performance computing (HPC) resources may be necessary to mitigate these constraints. The Alabama Supercomputer (ASC), the HPC resource used in this course, is free to researchers in Alabama and has been very supportive in generating temporary accounts for students and having a dedicated queue to student assignments. Please see Table 1 for a timeline with resources partitioned by week.

We used the Canvas LMS to present the course material to students. For this unit of the course, we organized it as a module in Canvas. We made extensive use of pages within Canvas to organize the content. We had a module overview page with a table of the topics and dates and the list of learning objectives for the unit. Similarly, each week had a similar, but more detailed page that listed tasks for the week with dates and times. It also included the materials and resources required that were external (e.g., chapter reading or tutorials). The weekly videos were also embedded directly into the pages for each week to organize the content for the students. Finally, the corresponding PDF of the slides from each video were provided for download. Within the module, lab assignments, corresponding keys, the quiz, and the capstone were also linked for easy student access. We found this centralized organization of course material helpful to keep students on track and well aware of what they needed to do each week. We also set up the module to require the labs to be done before completing the quiz, and for students to earn 80% or higher on the quiz to be able to view the capstone. This organization also forced students to work through the material in a specific order and ensure proficiency before attempting the final assessment.

Week 1

The first week, our students were given a guided video tour of using R in RStudio and on command line, with inline coding as well as with R scripts. We also strongly suggested to our students to complete one of the linked tutorials (swirl or tinystats). These tutorials were selected to provide an introductory overview of the R programming language and/or RStudio environment at the beginning. Each course introduces basic R concepts and how to get datasets into R in advance of the lab activity (see Introduction), but also are presented in a similar modular format that could also serve as advanced material following the course. The content from these introductory tutorials was reinforced by a walkthrough video (supplemental video ‘Basic Summary Statistics in R’ and RMarkdown file; USpop.csv, BodyFat.csv) focused on getting data into R and generating summary statistics about the data. For lecture videos and walkthroughs, we provided students with the first 200 years of US census data for students to learn how to import comma-separated value (CSV) formatted data into R. This dataset is also built-in to R, but by using the file, it reinforces how to import CSV formatted datasets. CSV is a common format for many types of data that separate fields with commas and records with new lines. We covered calculating population mean, standard deviation, and range, generating R summaries that provide minimums, maximums, and quartiles for data, and running statistical correlations between vectors. We wanted students to understand the importance of using summary statistics in conjunction with basic plotting to explore datasets. We also used the built-in R dataset ‘Anscombe’, which is based on Anscombe’s quartet to demonstrate how different datasets can generate identical descriptive statistics, while the visual distributions from these data are very distinct. We provided students with a body fat dataset that includes several other body measurements to apply learned skills (17). We encouraged our students to perform the analyses that the lectures walked through on this new dataset and validate answers with their groups via MS Teams.

For the lab activity in week one, we unified all of the data summary and manipulation in R coding from the asynchronous lecture material as an application where students summarized and plotted quantitative, genomic data from a published dataset (18). It is very important to ensure students have completed the installation of the two software applications ahead of time as this can significantly delay the exercise. This can be done by asking students to open RStudio, type 2+2 in the console and pressing enter. To allow time to help those with installation hiccups, and to reinforce some of the content from the out of class assignments, before starting the lab activity, an in-class concept check was used to review R terminology using the Frayer Model (19). Each group was provided with an R command, which they were asked to define, describe, and provide usage and misusage examples.

For the lab activity, we introduced the concept of genome coverage and provided students a description of how the files they are using were generated. This information is included in the mini course as a handout (Supporting File S2) that walks through the production of a quality genome coverage plot across a single chromosome and the necessary data file. The data structure is the output of a program called samtools (20), which is a bioinformatics software program commonly used to analyze genomics data. The specific program is called ‘depth’ and the output is a three-column file that includes the number of sequencing reads at each position for each chromosome in a tab-delimited format. Once the students have completed the assignment, they have generated descriptive statistics and distributions of genome coverage for the entire genome and a single chromosome, a coverage plot for a single chromosome, and an R script to reproduce these commands and plots for similar data. The assignment handout includes conceptual questions throughout and reflective questions at the end of the assignment to reinforce student interpretation of advanced graphical data and inference of the importance of computational skills to study central biological constructs and data structures. Students worked within their assigned groups for lab assignments, sharing helpful tips and validating results as they completed the activity.

Instructors and TAs should work through the assignment ahead of time, making any necessary edits to the handout. Class time should be used to move between groups, encouraging students to make informative comments within their code and gauging individual progression through the assignment. In our experience, students will progress at various paces, so by encouraging them to work collaboratively in groups, they can help each other as much as possible. Groups that finish early can be instructed to customize their plots and practice writing the images to files. Final images can be shared via MS Teams if desired. The RMarkdown key provided can also be used by instructors as a follow-up to the activity to walk through with the students towards the end of class.

Week 2

For week two, we covered manipulating datasets and the application of introductory statistical concepts in R. An introductory statistics course is a pre-requisite for our course; therefore, we did not delve into statistics theory and general application of statistics. Our main objective was to cultivate computational thinking and provide students with the R syntax to apply the statistical concepts they have already learned. Lecture videos and walkthroughs for this week revisited the body fat dataset to demonstrate isolating and sub-setting datasets, creating vectors (an object or data structure in R), adding new vectors to datasets, and transforming vectors (supplemental video ‘Data Manipulation’ and RMarkdown file; BodyFat.csv). Additionally, we compiled a dataset generated by the class from a previous lab assignment where the students tested the programmatic efficiency of three different programming languages by performing basic data manipulation and recording the times and conditions in a Microsoft Form embedded into Canvas. The class generated dataset was used to demonstrate how to handle missing data, converting data to factors, alternative ways to subset data and handling outliers. We showed how to run an analysis of variance (ANOVA) on these data as well as interpreting and comparing the ANOVA table and boxplots generated (supplemental video ‘Advanced Statistical Concepts’ and RMarkdown file). We encouraged students to change or add predictor variables, use different subsets of the data, and share or compare results on Teams. This dataset did not have a reasonable sample size to explore other statistical models, so we revisited the body fat dataset to fit a general linear model on relationships they were asked to explore with correlation tests the week before. We did not have a lab activity this week but resumed in week three.

Week 3

In the third week of the R unit, we applied basic and advanced graphical concepts in R and introduced using the R language in a shell environment (e.g., HPC). The lecture videos and walkthroughs for this week explored more advanced visualization where we revisit the body fat data, this time focusing on incorporating results from statistical modeling into graphics. We covered functions that round numerical values, concatenate strings and numbers, adding and manipulating multiple axes, making multi-panel graphics, and using plotting devices to write images to files (supplemental video ‘Advanced Graphing in R’ and RMarkdown file; BodyFat.csv; Figure 2). Revisiting the depth activity from week one, we improved our previously generated genome coverage plots by introducing smoothing functions in R (e.g., loess) to highlight trends in high-density data, demonstrated using R scripts on the command line for high-throughput generation of graphs from high-density data, and reinforced the importance of reproducibility by using the same R script to generate genome coverage plots for multiple chromosomes (supplemental video ‘Using R on the Command Line’ and RMarkdown file; chrX.depth.out.zip file; Figure 3).

For the second lab activity, we built an interactive tutorial that functions within R using the “learnr” package (21). This tutorial allowed students to view, complete, or correct R syntax to generate and customize a variety of graph types in base R (i.e., does not require additional packages) with randomly generated datasets. Students did this tutorial as an individual lab activity and, upon completion, students were tasked to use provided datasets to generate and export a PDF of an effective graphic highlighting a feature of the data. Advanced graduate students also had the option to use their own datasets, if applicable, to provide more flexibility. This assignment required students to conglomerate and apply the concepts of data manipulation, summary statistics, and effective scientific graphics.

It would behoove the instructor to strongly encourage students to complete the instructions listed under the heading “Installation” on the activity page. This includes obtaining personal access tokens from GitHub and downloading the tutorial prior to attending class. GitHub personal access tokens are fairly new, but they are required for RStudio-GitHub communication. Prior to installing the tutorial, students will need to also install additional software on their computer. This is required to use the R package “devtools” which is used to install the tutorial. For PC/Windows users this is the software RTools from the CRAN website, and for Mac users this is Xcode developer tools from the Apple store. The latter may ask students to update their main OS, but this can be avoided by obtaining older versions of the software online. If this process proves too tedious, then instructors can explore alternatives to individual student downloads, such as CyVerse (see above section on Inclusive Teaching). This platform would allow the instructor to setup the tutorial in an online environment and provide direct access to students without any required installation.

In class, students should work through the tutorial independently, but are encouraged to be seated within their groups for troubleshooting support. While students work on the tutorial, instructors and TAs should move between groups encouraging students to modify and test plotting parameters similar to testing hypotheses using the scientific method. Groups that finish the tutorial early should revisit and discuss effective scientific graphics before they begin generating graphics with a different provided data set. Figures can be shared via MS Teams and evaluated using the guidance provided on effective scientific graphics. It is worth noting that within the tutorial, students are using the R package “tidyverse”, so if they want to work on graphics outside the tutorial using the same commands, they will need to load the library as they did when they installed the tutorial before class (e.g., library(tidyverse)).

Week 4

In the fourth and final week of the module, we taught R programming structures (e.g., conditional statements, functions), and integrated the R language with the shell environment on the Alabama Supercomputer (an HPC) for high-resource and high-throughput application of R. The lecture videos and walkthroughs for this week explored more advanced programming concepts where we cover conditional statements by computationally classifying progeny into crossover classes, similar to 3-point mapping and calculating recombination frequency typically covered in an introductory genetics class. We also covered creating functions and conditional statements to improve flow control and efficiency (supplemental video ‘Programming in R’ and Experiment5_rawdata.csv file). This data file is from an experimental genetic cross from the author’s research (22), connecting the students to real scientific data where they have direct access to the researcher.

In the final lab activity, students were challenged with revisiting their R scripts generated in week one and modifying them to run on an HPC. Students were introduced to this resource earlier in the semester when learning the command line. This activity required students to generate genome coverage plots for each chromosome in the fly genome, and reinforced earlier shell concepts in the course by integrating them with R. Students were encouraged to modify their code with the improvements demonstrated in week three videos (i.e., loess smoothing). We gave the students the option to structure their code to produce these plots in parallel by submitting multiple scripts to the ASC, or serially by looping the chromosomes through a single script in a single job submission. In this final week, students were also working on a capstone project using R to make publication quality figures of two different datasets, one from earlier in the semester and another with biomedical data.

To prep students for this lab, we reminded them to refresh their knowledge of using the HPC. Still, for many it was earlier in the semester, so they were slower to get started. They also struggled to transfer files, so we had a shared folder setup for the class with the files there as a backup. Many students did not finish this lab in the class period, but it helped them to see the possibility of how R code could be used to scale up to a larger problem and make use of additional computational resources. This was especially useful for graduate students who would eventually use this for their own research projects, but perhaps less useful for undergraduates who struggled quite a bit. The following class period, we did a walkthrough using the provided keys for students to follow along so that they could boost their confidence after having struggled. Still, it has been shown that frustration and struggle can be important for the learning process overall (23).

Teaching Discussion

Lesson Effectiveness

Overall, this section of the course has been highly engaging and well received by students. The students had mixed feedback on the growth mindset intervention at the beginning of the semester. Most responded positively that it helps them to build confidence, even though they find this material very challenging. For some students it had the opposite effect and made them more wary of the course. Finally, a subset of students felt it was unnecessary because they had prior exposure. Because R/RStudio is in the latter part of the course, which is otherwise dominated by the command line, the students seem much more relaxed in the GUI environment of R and RStudio. The RStudio environment is essential to reinforcing concepts such as reproducibility. We also try to make connections to datasets multiple times, which can feel repetitive at first, but it is also helpful to students in making connections across lessons. The flipped format really helped the students to have class time to practice what they were learning in the videos and to do live coding demos. Much like many courses taught in this way, there was mixed feedback with some students still preferring traditional lectures. However, we saw an overall improvement of the attitudes as compared to previous semesters. Students seemed to have a lot more confidence, perhaps due to watching the videos multiple times and getting a better foundation in the material. This was particularly true of our international students based on feedback at the end of the course.

One challenge encountered was that because the background of students was so varied, we found that a subset of students had been previously exposed to R. These students often neglected the Week 1 material only to get stumped in Week 2, so we made sure to emphasize that we were going to build on the material and that it might be different than their previous experiences. As R is introduced in various classes earlier in college curriculums, we expect this challenge to continue.

Some of the statistical concepts were a bit basic, which was important to us. We made sure to reiterate to students that this course was not meant to teach them appropriate statistics for data analysis, but instead teaches them how to implement these commands in R. The first two labs are relatively simple, but the third lab was definitely the hardest. Implementing shell and R together is something students struggled with, and many students did not finish this activity. Since this course was mostly targeted at graduate students, they often preferred the ability to use their own data in class assignments. For example, in the second lab activity, students were asked to either use provided datasets or use their own datasets to turn in a plot of their own. In previous iterations of the course with fewer or no undergraduates, we did not provide the additional sample datasets and relied on students having their own data. For a primarily undergraduate course, this would likely not be preferred since it would rely on students having some connection to a research lab.

For all the labs, anything students turned in received full credit, so finishing was not required. This style of grading took the pressure off of students and provided flexibility for graduate students who may not need all of the skills we taught. When combined with providing a key right after the deadline, students felt that if they did not fully understand any topic, then they could just turn something in and figure it out with the key later.

A year after the online course, we reached out to students to provide inspirational testimonies to their peers taking the course the following year. While this feedback was targeted to some of the best students in the course, we received positive feedback that the students were still using the course content in their dissertation work a year later. One student commented “It was the toughest, yet most rewarding course I’ve ever taken... Stick with it; you’ll be amazed at what you can do by the end, and you’ll be a better scientist for it”. Another student commented “the course is one you won't regret taking ... I'd say it's the most important course I have taken since the start of my program ... I routinely apply the concept of programming and data analysis I picked up from class in solving problems”.

Alternative Implementations and Adaptations

The first lab activity has been used several times as part of a stand-alone module within a weeklong Bioinformatics Bootcamp that mainly focuses on shell/linux coding of genomics data. The exercise was included towards the end as an example of how to visualize genomic data and an introduction to R. The students were provided a brief introduction to the major concepts (e.g., depth and where the files came from) and help installing the software, but otherwise this exercise was done with no prior exposure to R. Each time it was received well, with incremental tweaks being added each time. Additionally, this activity is provided on a separate GitHub page by itself (24) but has since been built upon in the version contained within the R mini course and handout provided here. 

Supporting Materials

  • S1. R Mini Course – Embedded Quiz Questions. This file contains a table of the quizzes that were embedded into each video. These can also be used for reading comprehension quizzes or similar assessments.

  • S2. R Mini Course – Activity #1 Handout. This two-page handout was used for Lab Activity #1 with students. R/RStudio update often, so we usually review it before each semester to make any necessary updates. A link to a Google Doc is also included on the GitHub page for this activity that may be more up-to-date.

Acknowledgments

The authors would like to thank David Young and the Alabama Supercomputer Authority for course HPC support. We would like to thank the students that have provided extensive feedback on these assignments throughout the semesters where this was taught. We would like to thank various GitHub users who piloted the mini course and provided feedback for improvement. Finally, we thank Frank McCowen and Todd Steury for permission to use their datasets as part of our course.

References

  1. Pevzner P, Shamir R. 2009. Computing has changed biology—biology education must catch up. Science 325:541–542. DOI:10.1126/science.1173876.
  2. Ekbia H, Mattioli M, Kouper I, Arave G, Ghazinejad A, Bowman T, Suri VR, Tsou A, Weingart S, Sugimoto CR. 2015. Big data, bigger dilemmas: A critical review. J Assoc Inf Sci Technol 66:1523–1545. DOI:10.1002/asi.23294.
  3. Markowetz F. 2017. All biology is computational biology. PLOS Biol 15:e2002050. DOI:10.1371/journal.pbio.2002050.
  4. Buitrago Flórez F, Casallas R, Hernández M, Reyes A, Restrepo S, Danies G. 2017. Changing a generation’s way of thinking: Teaching computational thinking through programming. Rev Educ Res 87:834–860. DOI:10.3102/0034654317710096.
  5. Mariano D, Martins P, Helene Santos L, de Melo-Minardi RC. 2019. Introducing programming skills for life science students. Biochem Mol Biol Educ 47:288–295. DOI:10.1002/bmb.21230.
  6. Metz AM. 2008. Teaching statistics in biology: Using inquiry-based learning to strengthen understanding of statistical analysis in biology laboratory courses. CBE Life Sci Educ 7:317–326. DOI:10.1187/cbe.07-07-0046.
  7. Klug JL, Carey CC, Richardson DC, Darner Gougis R. 2017. Analysis of high-frequency and long-term data in undergraduate ecology classes improves quantitative literacy. Ecosphere 8:e01733. DOI:10.1002/ecs2.1733.
  8. Porter SG, Smith TM. 2019. Bioinformatics for the masses: The need for practical data science in undergraduate biology. OMICS J Integr Biol 23:297–299. DOI:10.1089/omi.2019.0080.
  9. R Core Team. 2020. R: A Language and Environment for Statistical Computing [Computer software]. Vienna, Austria. Available from https://cran.r-project.org/.
  10. RStudio Team. 2020. RStudio: Integrated Development for R [Computer software]. Boston, MA. Available from https://posit.co/.
  11. Stevison L, Clark A. 2022. R Mini Course. Auburn (AL): GitHub; [accessed 2022 May 27] https://stevisonlab.github.io/R-Mini-Course/. DOI:10.5281/zenodo.6588432.
  12. Kross S, Carchedi N, Bauer B, Grdina G, Schouwenaars F, Wu W. 2020. swirl: Learn R, in R [Computer software]. System requirements: R (>= 3.1.0). R package; Available from https://swirlstats.com/.
  13. Walum H, De Leon D. 2022. Teacups, giraffes, & statistics. Retrieved from https://tinystats.github.io/teacups-giraffes-and-statistics/index.html.
  14. Peterson MP, Malloy JT, Buonaccorsi VP, Marden JH. 2015. Teaching RNAseq at undergraduate institutions: A tutorial and R package from the Genome Consortium for Active Teaching. CourseSource 2. DOI:10.24918/cs.2015.14.
  15. Yeager DS, Hanselman P, Walton GM, Murray JS, Crosnoe R, Muller C, Tipton E, Schneider B, Hulleman CS, Hinojosa CP, Paunesku D, Romero C, Flint K, Roberts A, Trott J, Iachan R, Buontempo J, Yang SM, Carvalho CM, Hahn PR, Gopalan M, Mhatre P, Ferguson R, Duckworth AL, Dweck CS. 2019. A national experiment reveals where a growth mindset improves achievement. Nature 573:364–369. DOI:10.1038/s41586-019-1466-y.
  16. Kabacoff RI. 2011. R in action: Data analysis and graphics with R, 1st ed. Manning Publication Co, Shelter Island, NY.
  17. Johnson RW. 1995. Body fat dataset. Retrieved from http://lib.stat.cmu.edu/datasets/bodyfat.
  18. McGaugh SE, Noor MA. 2012. Genomic impacts of chromosomal inversions in parapatric Drosophila species. Philos Trans R Soc Lond B Biol Sci 367:422–429. DOI:10.1098/rstb.2011.0250.
  19. Frayer DA, Fredrick WC, Klausmeier HJ, Herbert J. 1969. A schema for testing the level of concept mastery: Report from the Project on Situational Variables and Efficiency of Concept Learning. Wisconsin Research and Development Center for Cognitive Learning, Madison, WI.
  20. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. DOI:10.1093/Bioinformatics/Btp352.
  21. Aden-Buie G, Schloerke B, Allaire JJ. 2020. learnr: Interactive Tutorials for R [Computer software]. Version 0.10.1. Accompanied by: 1 manual. System requirements: pandoc (>= 1.14) - http://pandoc.org. R Package; Available from https://cran.r-project.org/web/packages/learnr/index.html. DOI:10.5281/zenodo.3666930.
  22. Altindag UH, Stevison L. 2021. Peak-Plasticity-Project. Auburn (AL): GitHub; [accessed 2022 April 8]. https://github.com/StevisonLab/Peak-Plasticity-Project. DOI:10.5281/zenodo.4477672.
  23. Lopatto D, Rosenwald AG, DiAngelo JR, Hark AT, Skerritt M, Wawersik M, Allen AK, Alvarez C, Anderson S, Arrigo C, Arsham A, Barnard D, Bazinet C, Bedard JEJ, Bose I, Braverman JM, Burg MG, Burgess RC, Croonquist P, Du C, Dubowsky S, Eisler H, Escobar MA, Foulk M, Furbee E, Giarla T, Glaser RL, Goodman AL, Gosser Y, Haberman A, Hauser C, Hays S, Howell CE, Jemc J, Johnson ML, Jones CJ, Kadlec L, Kagey JD, Keller KL, Kennell J, Key SCS, Kleinschmit AJ, Kleinschmit M, Kokan NP, Kopp OR, Laakso MM, Leatherman J, Long LJ, Manier M, Martinez-Cruzado JC, Matos LF, McClellan AJ, McNeil G, Merkhofer E, Mingo V, Mistry H, Mitchell E, Mortimer NT, Mukhopadhyay D, Myka JL, Nagengast A, Overvoorde P, Paetkau D, Paliulis L, Parrish S, Preuss ML, Price JV, Pullen NA, Reinke C, Revie D, Robic S, Roecklein-Canfield JA, Rubin MR, Sadikot T, Sanford JS, Santisteban M, Saville K, Schroeder S, Shaffer CD, Sharif KA, Sklensky DE, Small C, Smith M, Smith S, Spokony R, Sreenivasan A, Stamm J, Sterne-Marr R, Teeter KC, Thackeray J, Thompson JS, Peters ST, Van Stry M, Velazquez-Ulloa N, Wolfe C, Youngblom J, Yowler B, Zhou L, Brennan J, Buhler J, Leung W, Reed LK, Elgin SCR. 2020. Facilitating growth through frustration: Using genomics research in a course-based undergraduate research experience. J Microbiol Biol Educ 21:21.1.6. DOI:10.1128/jmbe.v21i1.2005.
  24. Stevison L. 2017. Intro to R and the RStudio environment. Auburn (AL): GitHub; [accessed 2017 Nov 7]. https://github.com/StevisonLab/Intro-to-R-and-the-RStudio-Environment. DOI:10.5281/zenodo.1043577.

Article Files

to access supporting documents

Authors

Author(s): Amanda D. Clark1, Laurie S. Stevison*1

Auburn University

About the Authors

*Correspondence to: Laurie Stevison; 101 Rouse Life Science Bldg, Auburn, AL 36849;  lss0021@auburn.edu 

Competing Interests

None of the authors have a financial, personal, or professional conflict of interest related to this work.

Comments

Comments

There are no comments on this resource.