Teaching introductory bioinformatics with Jupyter notebook-based active learning
With growing evidence that active learning is more effective than traditional lecturing with respect to student performance in STEM, there has been increasing interest within the bioinformatics community to adopt active learning approaches within our courses. Active learning can take many forms, from brief in-class problems or quizzes interspersed between segments of a traditional lecture, to “flipped” classrooms, in which students watch video lectures outside of class and participate in activities during the class period. Along these lines, within the bioinformatics community there has been work in developing materials supporting active learning, including the creation and sharing of video lectures and programming problem-based activities.
To experiment with such active learning approaches in teaching undergraduate-level bioinformatics, I recently revamped the course “Introduction to Bioinformatics” at the University of Wisconsin-Madison, turning what had previously been a traditional lecture-based class into a largely flipped classroom. With a focus on computer science and statistics foundations, this course covers the topics of sequence assembly, sequence alignment, phylogenetic trees, genome annotation, clustering, and biological network analysis. Prior to each class period, the students were asked to watch one or more short video lectures, complete an assigned reading, take a short online quiz, and submit questions to an online discussion board. After a short discussion of the most common questions from the pre-class material, the bulk of the in-class time was spent with the students completing programming or written problems within cloud-based Jupyter notebooks. The other components of the course, homework and exams, were kept comparable to those used in previous versions of the course.
The most novel aspect of the revamped course was the set of over 30 Jupyter notebooks (Python kernel) that the students completed as part of their in-class activities. For the typical class period, the students were presented with a new notebook template that contained an average of three problems that they were to complete for a small yet non-negligible part of their overall grade. The most common format of a problem was to fill in the definition of Python function that performed some subtask of an algorithm that had been covered in the pre-class materials. Other common problems involved visualizing the results of an analysis, taking advantage of the interactive plotting features of Jupyter notebooks. Students were encouraged to work with each other to complete the in-class notebooks and were arranged in groups of four within the classroom, which was equipped with laptops for every student. The notebooks were autograded and students were allowed to submit their work multiple times until they passed the autograder tests.
Course evaluations revealed that the students generally enjoyed the notebook activities and the flipped format of the course. The most common criticism of the course by students was that the in-class activities required too much time, with many students spending hours after the class period to complete the notebooks. Although a direct comparison of grades across semesters is difficult due to numerous varying factors, the median undergraduate score did increase by roughly three percentage points in the revamped course as compared to my last offering of the course, although this difference was not statistically significant. As an instructor, I enjoyed the fact that the flipped format enabled me to spend more time working one-on-one with students who were struggling in the class. One lesson learned was that three-day-per-week, 50-minute class periods were suboptimal for notebook-based activities and thus the next offering of the course will use a two-day-per-week 75-minute class period format.
Fangzhou Mu, Chris Magnano, Debora Treu and Anthony Gitter
The ml4bio Workshop: Machine Learning Literacy for Biologists
Machine learning has been incredibly successful in mining large-scale biological datasets. Despite its popularity among computational researchers, machine learning remains elusive to experimental biologists, who form the majority of the life sciences research community, leaving powerful computational tools underappreciated and data generated in wet labs underexplored. Recent years have seen a growing interest among biology trainees to embark on machine learning projects that complement their research. However, most machine learning courses and tutorials require substantial background knowledge in coding and mathematics, which many biologists may lack. On the other hand, bioinformatics workshops for biologists assume less coding experience, but participants are often taught to mechanically run through a software pipeline for certain tasks without learning the best practices in various stages of the workflow. Such an approach, though effective in the short term, can lead to error-prone data analysis, misinterpretation of results, and difficulty in adapting to other tasks in the long run of a scientist’s research effort. The community clearly needs to explore novel educational frameworks in order to address these challenges in teaching machine learning to biologists.
Unlike traditional task-centric approaches, our educational objective is to equip biologists with the proper mindset when it comes to applying machine learning in their research and the ability to critically analyze machine learning applications in their domain. Built around this core idea, our ml4bio workshop prioritizes teaching machine learning literacy, that is, the right way to set up learning problems, how to reason about learning algorithms, and how to assess learned models. We have developed interactive software with a graphical interface and a set of accompanying slides and tutorials for use during workshop sessions. The software and interactive exercises guide participants through a full cycle of the machine learning workflow while doing proper model training, validation, selection, and testing. By following instructions in the slides and tutorials, participants build intuition about the strengths and weaknesses of various model classes and evaluation metrics by visualizing model behavior under different data distributions and sets of model hyperparameters. We further attempt to mind the gap between theory and practice through illustration of machine learning applications on real biological tasks. Overall, our approach encourages beginners to take a holistic view of the machine learning workflow rather than immediately dive into the technicalities of coding and mathematics. We have successfully offered two pilot workshops attended by graduate students and postdocs with diverse backgrounds and research interests. The feedback we collected provides strong preliminary evidence on the effectiveness of our approach.
Moving forward, our short-term plan is to tailor the workshop material to better serve our educational objective and the needs of participants. The current version of the software only supports classification models. For future releases, we will expand the set of models to include those for regression and clustering. We are also looking for new biological case studies that highlight good and bad practices of machine learning in the biological literature. Our long-term software development plan is to more closely link the ml4bio graphical interface and the Python scikit-learn code on which it is built in order to guide participants who wish to later customize their own machine learning pipeline. Our ultimate goal is the national distribution of the workshop. As an initial step towards this end, we are working closely with educators and facilitators on and off campus to outline a timetable on future workshop development and to adopt best practices of successful workshops such as Software and Data Carpentry. Our workshop materials are available at https://github.com/gitter-lab/ml-bio-workshop/ under the CC-BY-4.0 license and our ml4bio software is available at https://github.com/gitter-lab/ml4bio/ and PyPI under the MIT license.
Candace Savonen, Deepashree Prasad, Casey Greene and Jaclyn Taroni
Increasing data analysis skills in the pediatric cancer community with the Childhood Cancer Data Lab training workshops
The vast amount of genomic data generated each year hold valuable information about the underlying biology of complex diseases. Often biomedical researchers are not readily equipped to use these types of data to answer their biological questions of interest. The Childhood Cancer Data Lab (CCDL) is an initiative of Alex’s Lemonade Stand Foundation (ALSF), an organization devoted to fighting childhood cancer that has funded almost 1000 grants at 135 institutions. The CCDL was founded in late 2017 to empower childhood cancer researchers to harness the power of “big data.” Here, we present our early experiences designing, implementing, and executing short, 3-day training workshops centered on gene expression analysis for pediatric cancer researchers with little to no experience in bioinformatics or programming as part of the CCDL.
We identified RNA-seq analysis as a major area of need in the pediatric cancer community based on an online survey of primarily pediatric cancer-focused researchers and discussions with researchers at Alex’s-focused and national meetings. Accordingly, we constructed our workshop curriculum to prepare researchers to perform processing and analysis of transcriptomic data with an emphasis on reproducibility. The workshop is done in an interactive, small-group setting (20 participants or fewer). Participants are primarily drawn from ALSF-funded research groups and others in the pediatric cancer field.
All analyses are conducted within a Docker container prepared by CCDL staff to promote a reproducible, reusable software stack. We use the download, quality control (FastQC), and quantification of RNA-seq data (Salmon) as an opportunity to introduce the command line and shell scripting for reproducibility. Participants use the R programming language and Bioconductor to perform downstream analyses on childhood cancer-specific data, such as differential expression analyses and hierarchical clustering. We use R Notebooks that are prepared by CCDL staff in part to emphasize documenting computational results at the time that they are obtained.
A portion of time near the end of the workshop is set aside for researchers to bring in their own data or identify publicly available data relevant to their scientific question of interest. This allows CCDL staff to provide help with situation-specific issues. During registration and a pre-workshop survey for accepted participants, we ask about their research question, what kind of data they have, and what challenges they have been encountering. Participants use the processing and analysis steps that they have learned in prior modules and present their results to the rest of the group.
All participants of the pilot workshop said they would recommend the training to their peers on a post-workshop quality improvement questionnaire. Based on the questionnaire from this pilot, we further refined the curriculum to include introduction to R and single-cell RNA-seq modules. Ideally, future workshops would include modules that emphasize version control and reproducibility beyond data analysis (e.g., effectively sharing wet lab research products such as experimental protocols). The workshop’s curriculum is publicly maintained and updated on Alex’s Lemonade GitHub (https://github.com/AlexsLemonade/training-modules). The larger vision is to train and support individuals at other institutions to use our curricula to host their own workshops. This is a scalable solution to increase bioinformatics skills throughout the childhood-cancer research community. ALSF will encourage the expansion of these workshops by providing financial and administrative support. The CCDL will continue to develop these workshops in efforts to catalyze the search for childhood cancer cures by equipping more researchers with foundational bioinformatic skills.
Pamela Shaw, Matthew Carson, Sara Gonzales, Kristi Holmes, Robin Champieux and Ted Laderas
Bioinformatics in the Library: Bridging the Skills Gap for Biomedical Researchers
Computational skills training for the biomedical research workforce can be challenging. Informatics courses for graduate students have to compete for space in crowded biomedical sciences curricula; and established researchers—principal investigators, postdoctorates, research staff, and research faculty—approach training from a variety of computational competency levels. Graduate students, faculty, and staff alike often lack technical skills in basic computer literacy, programming languages, data management and analysis, and data workflow management. There is a need for extracurricular training to bridge these skills gaps. One-time workshops and boot camps provide a large amount of information in a short time, but knowledge gained from these workshops is lost quickly. One solution to overcome these gaps and knowledge losses is to develop extracurricular programs that provide skill-building sessions at a variety of levels of computational competence throughout the year, supplemented by online, self-paced training materials. Such educational variety can best be achieved by collaborative partnerships between campus units and resource centers.
Library-based informatics programs have been in place at several universities and medical schools for over fifteen years. These programs are staffed by Master’s or PhD level science graduates, and provide training and consultation in a variety of computational skills. The library is a perfect partner in providing supplemental computational skills training for researchers: it is a neutral, trusted entity; it is known for knowledge management; and the library has strong collaborative partnerships with other campus centers, core facilities, and research computing services.
We present the library as a primary point of contact for informatics and data management. The library offers consultation and training to researchers for their computational needs. In cases where longer-term or more intensive support is needed, the library provides a referral services to core facilities and specialists on campus.
We also present BioData Club: a kit of resources available on GitHub. BioData Club was developed by Oregon Health and Science University under the Clinical Data to Health (CD2H) cooperative agreement. A goal of the CD2H educational initiative is to pilot and implement the BioData Club kit at CTSA institutions and other academic health sciences sites. The kit is available on GitHub and provides templates and guidance for establishing a BioData Club at an institution, complete with templates and links to repositories with developed instructional materials. It is hoped that the kit will expand with each institution’s instance to provide a wide variety of instructional and communication materials for improving computational competence among biomedical researchers.