Resource Image

The ml4bio Workshop: Machine Learning Literacy for Biologists

Author(s): Fangzhou Mu1, Chris Magnano1, Debora Treu2, Anthony Gitter1

1. University of Wisconsin - Madison 2. Morgridge Institute for Research

1438 total view(s), 311 download(s)

0 comment(s) (Post a comment)

Summary:
Presentation on machine learning literacy for biologists at the 2019 Great Lakes Bioinformatics Conference

Licensed under CC Attribution-NonCommercial-ShareAlike 4.0 International according to these terms

Version 1.0 - published on 02 Jun 2019 doi:10.25334/Q44Q97 - cite this

Description

Machine learning has been incredibly successful in mining large-scale biological datasets. Despite its popularity among computational researchers, machine learning remains elusive to experimental biologists, who form the majority of the life sciences research community, leaving powerful computational tools underappreciated and data generated in wet labs underexplored. Recent years have seen a growing interest among biology trainees to embark on machine learning projects that complement their research. However, most machine learning courses and tutorials require substantial background knowledge in coding and mathematics, which many biologists may lack. On the other hand, bioinformatics workshops for biologists assume less coding experience, but participants are often taught to mechanically run through a software pipeline for certain tasks without learning the best practices in various stages of the workflow. Such an approach, though effective in the short term, can lead to error-prone data analysis, misinterpretation of results, and difficulty in adapting to other tasks in the long run of a scientist’s research effort. The community clearly needs to explore novel educational frameworks in order to address these challenges in teaching machine learning to biologists.

Unlike traditional task-centric approaches, our educational objective is to equip biologists with the proper mindset when it comes to applying machine learning in their research and the ability to critically analyze machine learning applications in their domain. Built around this core idea, our ml4bio workshop prioritizes teaching machine learning literacy, that is, the right way to set up learning problems, how to reason about learning algorithms, and how to assess learned models. We have developed interactive software with a graphical interface and a set of accompanying slides and tutorials for use during workshop sessions. The software and interactive exercises guide participants through a full cycle of the machine learning workflow while doing proper model training, validation, selection, and testing. By following instructions in the slides and tutorials, participants build intuition about the strengths and weaknesses of various model classes and evaluation metrics by visualizing model behavior under different data distributions and sets of model hyperparameters. We further attempt to mind the gap between theory and practice through illustration of machine learning applications on real biological tasks. Overall, our approach encourages beginners to take a holistic view of the machine learning workflow rather than immediately dive into the technicalities of coding and mathematics. We have successfully offered two pilot workshops attended by graduate students and postdocs with diverse backgrounds and research interests. The feedback we collected provides strong preliminary evidence on the effectiveness of our approach.

Moving forward, our short-term plan is to tailor the workshop material to better serve our educational objective and the needs of participants. The current version of the software only supports classification models. For future releases, we will expand the set of models to include those for regression and clustering. We are also looking for new biological case studies that highlight good and bad practices of machine learning in the biological literature. Our long-term software development plan is to more closely link the ml4bio graphical interface and the Python scikit-learn code on which it is built in order to guide participants who wish to later customize their own machine learning pipeline. Our ultimate goal is the national distribution of the workshop. As an initial step towards this end, we are working closely with educators and facilitators on and off campus to outline a timetable on future workshop development and to adopt best practices of successful workshops such as Software and Data Carpentry. Our workshop materials are available at https://github.com/gitter-lab/ml-bio-workshop/ under the CC-BY-4.0 license and our ml4bio software is available at https://github.com/gitter-lab/ml4bio/ and PyPI under the MIT license.

Cite this work

Researchers should cite this work as follows: