Exercise

Introduction to BLAST using human leptin

Authors:Justin R. DiAngelo, Alexis Nagengast, Wilson Leung
Last Update:Aug 31, 2019
Version:0.0.1

What is BLAST?

BLAST stands for Basic Local Alignment Search Tool and it is a program that reports regions of similarity (at the nucleotide or protein level) between a query (your input) sequence and sequences within a database. BLAST uses a robust statistical framework that determines if the alignment between two sequences is statistically significant (i.e. has a low probability of the reported alignment being produced by chance alone). The ability to detect sequence similarity allows scientists to determine if a gene or a protein is related to other known genes or proteins in the same species or between species.

The theory of evolution is based on all organisms descending from common ancestors by speciation. At the molecular level, an ancestral DNA sequence diverges over time (through accumulation of point mutations, duplications, deletions, transpositions, recombination events, etc.) to produce diverse sequences in the genomes of living organisms. Such sequences are classified as homologs if they come from the same ancestral gene.

Mutations in genes with an important biological function have a higher probability of being harmful to the organism and are less likely to become fixed in a population. Such sequences are said to be under negative selection, which causes them to be conserved against change over time. Therefore, it is expected that two homologous copies of a functional sequence will show a higher degree of sequence conservation (observed as base-by-base similarity at the nucleotide level) than either two unrelated sequences or two sequences that are not under strong negative selection. This similarity is the “signal” detected by a BLAST search.

Obtaining sequence using NCBI

Learning Objectives

  • Obtain a desired DNA or Protein sequence from the NCBI public database

The National Center for Biotechnology Information (NCBI) is a public database that houses molecular biology information including sequences from thousands of different species from mammals to fungi. We will explore some of the basic functionalities of the NCBI web site using leptin (LEP) — a gene that has been found to contain mutations associated with severe obesity and the development of type 2 diabetes. First, open a web browser and navigate to the NCBI web site at https://www.ncbi.nlm.nih.gov. To get information on obesity, click on the Genetics and Medicine link on the left (Figure 1), scroll about 1/3 of the way down the page and click on the Genes and Disease link. Scroll down the page and click on the “Nutritional and Metabolic Diseases” link, then click on the “Obesity” link to find a non-technical description of the hormone leptin and its role in weight control. On the right panel, you will find links to other parts of NCBI that contain more information about this disease. For example, you can access the corresponding gene record in the OMIM database (a catalog of human genes and disorders) through the “OMIM” link.

NCBI Home Page

Figure 1. Click on the “Genetics & Medicine” link to access the “Genes and Disease” database.

Question 1

Based on the information on this page, how does leptin control feeding?

To obtain the sequence for the human LEP gene, go back to the NCBI homepage at https://www.ncbi.nlm.nih.gov/. At the top of the page, use the pull-down menu next to “Search” and select Gene. Enter LEP homo sapiens into the text box and click Search (Figure 2).

Note

This search menu is also available at the top of the NCBI Bookshelf page. Hence you could scroll to the top of the page, click on the drop-down box to select Gene and then search for LEP homo sapiens directly. You can also access the Entrez gene record through the “Entrez Gene” link under the “Gene sequence” section on the right panel.

Search for LEP in Homo Sapiens

Figure 2. Search for “LEP homo sapiens” in the NCBI Gene database

Scroll down to the “Search results” section. This search produces 54 results, with the first entry being the gene record for human leptin. The remaining results may mention leptin in their detailed summary. Even before we click on the leptin entry, we can already obtain some useful information about the leptin gene from the search results page. For example, the chromosomal location and OMIM entry number for leptin are shown.

Question 2

According to these search results, on which chromosome is the leptin gene located?

Click on the first match to the human LEP gene to learn more about this gene and its sequence. The Entrez Gene record for LEP shows lots of detailed information about gene structure and function.

The first section of the Entrez Gene record is the Summary section and it contains:

  • The official symbol and name approved by HUGO Gene Nomenclature Committee (HGNC)
  • Other synonyms that have been used to describe this gene
  • The organism the gene is from and the lineage of that organism
  • Links to external databases (e.g., HGNC and Ensembl) that contain similar information
  • Link to the NCBI’s Online Mendelian Inheritance in Man (MIM) that provides comprehensive information aimed for the medical and scientific research community
  • RefSeq status. The NCBI Reference Sequence Database (RefSeq) is a comprehensive, curated database of non-redundant sequence records. The accession number for a RefSeq record begins with two characters, follow by an underscore. This prefix denotes the type of sequence record:
    • chromosomes: NC_
    • genomic regions: NG_
    • mRNA: NM_
    • proteins: NP_

The “Genomic regions, transcripts, and products” section has a map of the gene with links to the sequence. The sequence record can be retrieved in many different formats, including FASTA (a simple text format which contains only the sequence — easiest for BLAST searching) or GenBank (more descriptive version with the sequence listed at the end). The leptin gene is on the positive strand of the chromosome (meaning that if the chromosome was oriented from left to right, your gene would be on the strand running from 5’ to 3’). Specifically, the leptin gene can be found at 128,241,201-128,257,629 of human chromosome 7 in the current assembly (GRCh38.p13).

Question 3

How do you know that the leptin gene is on the + strand of the chromosome it’s located on?

Question 4

The picture depicts two annotated transcripts (XM_005250340.5 and NM_000230.3). What is the difference between the RefSeq record that begins with the XM_ prefix and the record that begins with the NM_ prefix? Both mRNA records have 4 green boxes, 2 light green in color (1 on each end) and 2 dark green. What do you think each of those boxes refers to?

Tip

  • Examine the “COMMENT” section of the GenBank record or the RefSeq FAQ page to determine the difference between the “NM_” and “XM_” RefSeq records.
  • Click on the green boxes in the diagram to get more information on the LEP gene.

Sequences are most conserved between species at the amino acid level and this is what we will use in our BLAST searches. The protein sequences for the two annotated transcripts are available under the “NCBI Reference Sequences (RefSeq)” section. Note that there are multiple options to obtain the protein sequence for the leptin precursor, including the RefSeq protein database (NP_000221.1), the Consensus CDS (CCDS) database (CCDS5800.1), the UniProtKB/Swiss-Prot database (P41159) and the UniProtKB/TrEMBL database (A4D0Y8).

To do your own BLAST search, scroll down to the “NCBI Reference Sequences (RefSeq)” section and click on the NP_000221.1 link.

Question 5

Why do you think you want the protein sequence as opposed to the nucleotide sequence?

This screen gives you information on the LEP protein sequence. Click on the FASTA link directly beneath the gene name at the top of the screen. This gives you the FASTA definition line (as indicated by >) followed by the one letter amino acid sequence. The definition line varies but it always begins with the accession number of the sequence record (e.g., NP_000221.1), followed by the gene name and the organism. Copy the sequence including the definition line for use in a BLAST search.