The bioinformatics competencies that NIBLSE recommends as essential for undergraduate life sciences students are listed below. The competencies are informed by the results of the national NIBLSE survey, analysis of ninety syllabi with bioinformatics content, and the cumulative expertise and experience of the authors. Following each competency is a list of three representative examples illustrating the competency.
A publication describing these core competencies was recently published in PLOS ONE.
C1. Explain the role of computation and data mining in addressing hypothesis-driven and hypothesis-generating questions within the life sciences. Life sciences students should have a clear understanding of the role computing and data mining play in modern biology. Given a traditional hypothesis-driven research question, students should have ideas about what types of data and software exist that could help them answer the question quickly and efficiently. They should also appreciate that mining large datasets can generate novel hypotheses to be tested in the lab or field.
- Compare and contrast computer-based research with wet-lab research.
- Explain the role of computation in finding genes, detecting the function of protein domains, and inferring protein function.
- Describe the role of various databases in identifying potential gene targets for drug development.
C2. Summarize key computational concepts, such as algorithms and relational databases, and their applications in the life sciences. To make use of sophisticated software and database tools, students should have a basic understanding of the principles upon which these tools are based and should be exposed to how these tools work.
- Explain the underlying algorithm(s) employed in sequence alignment (e.g., BLAST).
- Modify software parameters to achieve biologically meaningful results.
- Explain how data are organized in relational databases (e.g., NCBI and model organisms
C3. Apply statistical concepts used in bioinformatics. In addition to the basic statistics found in many biology curricula, modern life scientists should have an understanding of the statistics of large datasets and multiple comparisons.
- Understand that there is a probability of finding a given sequence similarity score by chance (the P value) and that the size of the target database affects the probability that you will see a particular score in a particular search (the E-value).
- Explain the statistical modeling used to identify differentially expressed genes.
- Interpret data from a well-designed drug trial..
C4. Use bioinformatics tools to examine complex biological problems in evolution, information flow, and other important areas of biology. This competency is written broadly so as to encompass a variety of problems that can be addressed using bioinformatics tools, such as understanding the evolutionary underpinnings of sequence comparison and homology detection; distinguishing between genomic sequences, RNA sequences, and protein sequences; and
23 interpreting phylogenetic trees. “Complex” biological problems require that students should be able to work through a problem with multiple steps, not just perform isolated tasks.
- Using multiple lines of evidence, annotate a gene.
- Develop and interpret a “tree of life” based on a BLAST search, multiple alignment, and
- Explain how a mutation in a gene causes cancer, using a genome browser to identify the gene, transcript, and affected protein, and tools such as OMIM, GO, and KEGG to place it in the context of a function and pathway important to the disease.
C5. Find, retrieve, and organize various types of biological data. Given the numerous and varied datasets currently being generated from all of the ‘omics fields, students should develop the facility to: identify appropriate data repositories; navigate and retrieve data from these databases; and organize data relevant to their area of study in flat files or small local stand-alone databases.
- Navigate and retrieve data from genome browsers
- Retrieve data from protein and genome databases (e.g., PDB, UniProt, NCBI)
- Store and interrogate small datasets using spreadsheets or delimited text files.
C6. Explore and/or model biological interactions, networks and data integration using bioinformatics. Modeling of biological systems at all levels, from cellular to ecological, is being facilitated by technological and algorithmic advances. These models provide novel insights into the perturbations in systems that can cause disease, interactions of microbes with various eukaryotic systems, how metabolic networks respond to environmental stresses, etc. Students should be familiar with the techniques used to generate these analyses and should be able to interpret the outputs and use the data to generate novel hypotheses.
- Predict impact of a gene knockout on a cell-signaling pathway.
- Analyze gene expression data to build an expression network.
- Analyze metagenomics data from microbial samples obtained from environmental sources.
C7. Use command-line bioinformatics tools and write simple computer scripts: Most biological datasets (e.g., genomic and proteomic sequences, BLAST results, RNASeq and resulting differential expression data) are available as text files; the most powerful and dynamic way to interact with these datasets is through the command line or shell scripting. Students should be able to manipulate their own data and to create and modify complex data processing and analysis workflows.
- Run BLAST using command line options
- Build and run statistical analyses using R or Python scripts
- Write simple shell scripts to manipulate files
C8. Describe and manage biological data types, structure, and reproducibility. This competency addresses two distinct concerns: 1) each of the varied ‘omics fields produces data in formats particular to its needs, and these formats evolve with changes in technologies and refinements in 24 downstream software; and 2) all experimental data is subject to error and the user must be cognizant of the need to verify the reproducibility of their data. Students need to develop an awareness of, and ability to, manipulate different data types given the versioning of formats. They also need to exercise caution, to carry out appropriate statistical analyses on their data as part of normal operating procedures and report the uncertainty of their results, and to provide the relevant information to enable reproduction of their results.
- Describe the various sequence formats used to store DNA and protein sequences (e.g., FASTA, FASTQ).
- Understand the representation of gene features using Gene Feature Format (GFF) files.
- Compare reproducibility of biological and technical replicate data (e.g., transcriptomic data) using statistical tests (Spearman rank test and false discovery calculations).
C9. Interpret the ethical, legal, medical, and social implications of biological data. The increasing scale and penetration of human genetic and genomic data has greatly enhanced our ability to identify disease-related loci, druggable targets, etc. and to identify potential genes for replacement with developing techniques. However, with this information also comes many ethical, legal, and social questions; suggested resolutions are often outpaced by the technological advances. As part of their scientific training, students should debate the medicinal, societal, and ethical implications of these information sets and techniques.
- Explain the implications, good and bad, of being able to walk into a doctor’s office and have your genome sequenced and analyzed or of being able to obtain information from direct-to-consumer testing services.
- Be able to discuss different perspectives about who should have access to this data and how it should be protected.
- Describe how the scientific community protects against the falsification or manipulation of large datasets.