Introduction to nucleotide sequence analysis and protein modeling in MEGA and PyMol using coronavirus SARS-CoV-2

Author(s): Maria Shumskaya1, Christopher Zambell1, Nicholas Lorusso2

1. Kean University 2. University of North Texas at Dallas

Introduction into computational approaches in phylogeny and protein modeling based on coronavirus SARS-CoV-2 (caused COVID-19 pandemic). Two self-guided tutorials for standard lab classes of 2.5 hours. Level: undergraduate students majoring in biology.

Version 1.2 - published on 03 Feb 2022 doi:10.25334/GBJ6-A006 - cite this


This exercise is designed to introduce a learner into how a variety of computational approaches can be used to answer some biological questions and is based around a specific example: coronavirus SARS-CoV-2, which caused 2019-2020 pandemic.

The exercise consists of two assignments with each given in one 2.5 hours standard lab periods.

During lab 1, publicly available nucleotide sequences of SARS-CoV-2 and other viruses will be used to compare similarity and create hypotheses for the relationships between SARS-related coronaviruses from bats, pangolins, hares, and humans (SARS-CoV, SARS-CoV-2, MERS) . A freely available software, MEGA (Kumar et al. 2018), will be used to compare different RNA viruses within the Coronaviridae. The learner will 1) align sequences for the RNA-dependent RNA polymerase (RdRP) gene, 2) create a hypothetical phylogeny using maximum parsimony, and 3) create a second hypothetical phylogeny using maximum likelihood. We will then compare the produced predictions, hypothesize on the origins of SARS-CoV-2 and consider the strengths and weaknesses of either approach.

During lab 2, the learner will use a free for education PyMol ("The PyMOL Molecular Graphics System"  2010) software to visualize a receptor-binding domain of the spike protein of SARS-CoV-2 together with the host receptor ACE2, and label regions and amino acids important for binding to the receptor and antibody recognition.

Learning objectives:

After successful completion of this exercise, students will be able to:

  • Develop abilities necessary for understandings about scientific inquiries.
  • Identify appropriate computational approaches addressing biological questions.
  • Evaluate data and graphical representation.
  • Critically assess experimental results and draw conclusions based on the data.
  • Apply bioinformatics methods to solve biological problems:
    • Find and interpret data from major online databases such as NCBI GenBank and PDB.
    • Use basic bioinformatics software such as MEGA and PyMol and analyze results provided by such software

Activity software:

MEGA X and PyMol

Dataset used:

All nucleotide sequences are publicly available from NCBI GenBank: Full genomic sequences were trimmed to keep the RdRp gene for teaching purposes.

Protein sequence is publicly available from PDB Protein Data Bank:

What is included in this module:

1. Student's handout to distribute in class. The handout is a step-by-step self-guided tutorial for two classes that would require a minimal assistance from the instructor.

2. Two PowerPoint presentations.

3. Dataset: FASTA file with nucleotide sequences to distribute to students during the first laboratory exercise on phylogeny.

4. YouTube video with a recording of a live introduction video for this activity:


The authors thank Daniel Fried and Christopher Zambell for their original tutorials, and Julia Annuzzi and Christian Meekins for critical reading of the manuscript.


Disclaimer: Due to the rapid emerging nature of data on the novel coronavirus 2019-nCoV, this exercise provides not the most current but rather represent predictions based on publications available on 3/30/2020. Nucleotide sequences were downloaded from NCBI GenBank and then trimmed for educational purposes. The authors will appreciate any adaptations suggested by colleagues.

