Resource Image

Introduction to nucleotide sequence analysis and protein modeling in MEGA and PyMol using coronavirus SARS-CoV-2

Author(s): Maria Shumskaya1, Christopher Zambell1, Nicholas Lorusso2

1. Kean University 2. University of North Texas at Dallas

321 total view(s), 163 download(s)

0 comment(s) (Post a comment)

Introduction into computational approaches in phylogeny and protein modeling based on coronavirus SARS-CoV-2 (caused COVID-19 pandemic). Two self-guided tutorials for standard lab classes of 2.5 hours. Level: undergraduate students majoring in…


Introduction into computational approaches in phylogeny and protein modeling based on coronavirus SARS-CoV-2 (caused COVID-19 pandemic). Two self-guided tutorials for standard lab classes of 2.5 hours. Level: undergraduate students majoring in biology.


This exercise is designed to introduce a learner into how a variety of computational approaches can be used to answer some biological questions and is based around a specific example: coronavirus SARS-CoV-2, which caused 2019-2020 pandemic.

The exercise consists of two assignments with each given in one 2.5 hours standard lab periods.

During lab 1, publicly available nucleotide sequences of SARS-CoV-2 and other viruses will be used to compare similarity and create hypotheses for the relationships between SARS-related coronaviruses from bats, pangolins, hares, and humans (SARS-CoV, SARS-CoV-2, MERS) . A freely available software, MEGA (Kumar et al. 2018), will be used to compare different RNA viruses within the Coronaviridae. The learner will 1) align sequences for the RNA-dependent RNA polymerase (RdRP) gene, 2) create a hypothetical phylogeny using maximum parsimony, and 3) create a second hypothetical phylogeny using maximum likelihood. We will then compare the produced predictions, hypothesize on the origins of SARS-CoV-2 and consider the strengths and weaknesses of either approach.

During lab 2, the learner will use a free for education PyMol ("The PyMOL Molecular Graphics System"  2010) software to visualize a receptor-binding domain of the spike protein of SARS-CoV-2 together with the host receptor ACE2, and label regions and amino acids important for binding to the receptor and antibody recognition.

Learning objectives:

After successful completion of this exercise, students will be able to:

  • Develop abilities necessary for understandings about scientific inquiries.
  • Identify appropriate computational approaches addressing biological questions.
  • Evaluate data and graphical representation.
  • Critically assess experimental results and draw conclusions based on the data.
  • Apply bioinformatics methods to solve biological problems:
    • Find and interpret data from major online databases such as NCBI GenBank and PDB.
    • Use basic bioinformatics software such as MEGA and PyMol and analyze results provided by such software

Activity software:

MEGA X and PyMol

Dataset used:

All nucleotide sequences are publicly available from NCBI GenBank: Full genomic sequences were trimmed to keep the RdRp gene for teaching purposes.

Protein sequence is publicly available from PDB Protein Data Bank:

What is included in this module:

1. Student's handout to distribute in class. The handout is a step-by-step self-guided tutorial for two classes that would require a minimal assistance from the instructor.

2. Two PowerPoint presentations.

3. Dataset: FASTA file with nucleotide sequences to distribute to students during the first laboratory exercise on phylogeny.

4. YouTube video with a recording of a live introduction video for this activity:


The authors thank Daniel Fried and Christopher Zambell for their original tutorials, and Julia Annuzzi and Christian Meekins for critical reading of the manuscript.


Andersen K, Rambaut A, Lipkin WI, Holmes EC, Garry RF (2020) The proximal origin of SARS-CoV-2. Nature Medicine. doi:10.1038/s41591-020-0820-9

Domingo E,  Perales C (2019) Viral quasispecies. PLOS Genetics 15. doi:10.1371/journal.pgen.1008271

Kumar S, Stecher G, Li M, Knyaz C, Tamura K (2018) MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Molecular Biology and Evolution 35: 1547-1549. doi:10.1093/molbev/msy096

Lan J, Ge J, Yu J, Shan S, Zhou H, Fan S, Zhang Q, Shi X, Wang Q, Zhang L, Wang X (2020) Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature. doi:10.1038/s41586-020-2180-5

Martinez J, Longdon B, Bauer S, Chan Y-S, Miller WJ, Bourtzis K, Teixeira L, Jiggins FM (2014) Symbionts Commonly Provide Broad Spectrum Resistance to Viruses in Insects: A Comparative Analysis of Wolbachia Strains. PLOS Pathogens 10. doi:10.1371/journal.ppat.1004369

Stadler K, Masignani V, Eickmann M, Becker S, Abrignani S, Klenk HD, Rappuoli R (2003) SARS - Beginning to understand a new virus. Nature Reviews Microbiology 1: 209-218. doi:10.1038/nrmicro775

The PyMOL Molecular Graphics System.  (2010). 1.3r1 edu ed. Schrödinger, LLC.

Walls AC, Park Y-J, Tortorici MA, Wall A, McGuire AT, Veesler D (2020) Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein. Cell. doi:10.1016/j.cell.2020.02.058

Wang C, Liu Z, Chen Z, Huang X, Xu M, He T, Zhang Z (2020) The establishment of reference sequence for SARS-CoV-2 and variation analysis. Journal of Medical Virology. doi:10.1002/jmv.25762

Zhang T, Wu Q,  Zhang Z (2020) Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19 Outbreak. Current biology 30: 1-6. doi:10.1016/j.cub.2020.03.022

Disclaimer: Due to the rapid emerging nature of data on the novel coronavirus 2019-nCoV, this exercise provides not the most current but rather represent predictions based on publications available on 3/30/2020. Nucleotide sequences were downloaded from NCBI GenBank and then trimmed for educational purposes. The authors will appreciate any adaptations suggested by colleagues.

Cite this work

Researchers should cite this work as follows:


There are no comments on this resource.