Blog

Data Science with SARS-CoV-2 Antibody Structures

Database growth since 1996

For the past 10 years, I've blogged about the annual Nucleic Acids Research (NAR) database issue. One of the challenges in teaching data science is having models and data to access. Bio databases are a rich source of both.

Last year I commented: “I typically I cover the issue in January, but I got busy, and then a pandemic broke out, so this year I'm later than usual.” The pandemic continues and I’m even later in getting this year’s summary written. At that time I wrote about how databases were responding to the pandemic. SARS-CoV-2 remains relevant, so I’ll continue the theme. But, first the numbers.

The opening overview by DJ Rigden and XM Fernández indicates that 90 new databases were added to the NAR compendium and 86 databases were removed in 2020. According to the editorial, the archive now lists 1641 databases. The small database growth is due to the trend that the number of new databases being submitted to the archive has leveled out, while a similar number are being deprecated.

As one might imagine database work related to SARS-CoV-2 (and coronaviruses) is active. The accompanying table (below image) lists 25 coronavirus-related databases. Each entry includes its name and link, a brief description, area of focus (immunology, genomics, RNA, proteins, drugs, literature), and whether it is listed in the NAR compendium. Six of the resources are not listed in NAR, and of the 19 that are listed in NAR, five are new.

CoV-AbDab

Databases are cool, and the best way learn about a resource is to explore it with a specific goal in mind. In between last year’s and this year’s writing, Digital World Biology was awarded an NSF-ATE (National Science Foundation, Advanced Technological Education) grant  to use hackathons as a way to develop education resources for undergraduate research in antibody engineering. Apropos to our grant, COVID-19, and structural biology education resources, the Digital World Biology team is interested in collecting molecular structures of antibodies that are bound to the RBD epitope of the SARS-CoV-2 spike protein (spike:RBD), as a way to teach and explore how anibodies bind antitiges via paratope (antiboty) / epitope (antigen) interactions. As CoV-AbDab (The Coronavirus Antibody Database), is a database related to antibodies that bind to Coronavirus proteins, and part of a collection of antibody structure databases, it is a resource worth exploring. 

Click to view the full table

Brief Background

CoV-AbDaB is supported by the Oxford Protein Informatics Group (OPIG), and is a member of a collection of antibody structural databases that includes SAbDab (The Structural Antibody Database), Thera-SAbDab (The Therapeutic Structural Antibody Database), and SAbDab-nano (a subset of SAbDab consisting of nanobody structures). The SAb databases collect antibody structure files from the PDB (RSCB Protein Databank) and curate these data by adding annotations about experimental details, antibody nomenclature (heavy and light chain pairings), affinity data, sequence annotations, and literature references. Other OPIG resources include SAbPred (The Antibody Prediction Toolbox; a collection of algorithms for predicting antibody properties), STCRDab (the Structural T-Cell Receptor Database), and OAS (Observed Antibody Space). OAS is a project seeking to collect and annotate immune repertoires.*

Antibody:spike:RBD Structures

As noted by its name, CoV-AbDab focuses on antibodies and nanobodies that bind to coronavirus (SARS-CoV-1, SARS-CoV-2, MERS, and others) proteins. The November 2021 database release holds 4,706 entries,** and its web-enabled search interface can be used to isolate subsets of the data using different combinations of 13 filters. In this way, one can learn that there are 1531 records of antibodies (Type = Antibody) that bind to SARS-CoV-2 (Binds to =  SARS-CoV2) that are noted as neutralizing (Neutralising against = SARS-CoV2) and bind to the spike:RBD (Protein/Epitope = Spike Protein:RBD).

Data Science in a Spreadsheet

We can download a table (CSV [comma, separated, values], or Excel) for the above list and do additional analyses to learn more. For example, the “Origin” column tells us something about the sample from which the antibody was derived, and the “Neutralising Vs” column tells us what virus(es) the antibody binds to and if its binding is weak. Using these data, we learn that most (83%) of the antibodies in the dataset are from human patient B-Cells. They bind well and are specific for SARS-CoV-2. 17% of the antibodies bind weakly, and a very small fraction cross react with SARS-CoV-1. A few antibodies, all weak binders, came from healthy subjects, perhaps from other coronavirus infections. The dataset also includes antibodies from mice, phage display, and engineering experiments.

1531 neutralizing antibodies is a lot of antibodies; how many have solved structures? 163 records have one or more structures that were obtained from the PDB. Another 1169 records have a structure that is predicted from homology modeling. The majority of the records with a PDB structure (116) have a single structure, as indicated by a single PDB ID or SabDab link. 44 records contain between two and six PDB structures, and three records have seven, eight and 11 PDB structures. All together there are 265 PDB structures of antibodies that might bind to a spike:RBD epitope. 

Origins of antibodies identified as neutralizing SARS-CoV-2 (data from the Sep 2021 CoV-AbDab release)


Getting information that I described, and show in the above graph, used pivot analyses to count the occurrences of specific values in different columns. At first, I found that some of the data were too granular. That is, the “Origin” column data included details about mouse strains, phage display experiments, and other information that made the graph messy and distracted for this story. Thus, data were converted to less granular descriptions such as “Mouse” or “Phage display." For counting structures, I could see some of the records have multiple PDB IDs. To count the number I had to create a new column and add a counting formula (number of “;” [the PDB ID separator] +1). I’m not going to get into all of the details, but rather emphasize that this kind of work is the core of data science, shaping, cleaning, and asking questions.

The Structures

The above data are interesting, but the big question is how many of the 265 structures are actually models of antibody/spike:RBD complexes. In that respect, the resource could be better. For example, to identify which of the 265 structures are what I want, I have look at each structure individually. CoV-AbDab’s web interface provides links to resources with structure viewers to simplify the process, but 265 clicks, followed by viewing each structure requires a lot of work. To test how this could work, I randomly clicked on a few of the PDB links and learned some of the PDB links are broken, and other structures were of just an antibody, which confirmed my suspicion that structures of antibody/spike:RBD complexes is a subset of the 265, minus dead links, available structures. Spotting antibody only structures, likely deposited as part of the research study exploring neutralizing antibodies, requires some experience too. Finally, remembering which structures were inspected and what was learned needs to be noted in someway outside of the interface. Curation is hard.

In summary, OPIG’s antibody resources are a great for studying antibody/nanobody structural biology. As evidenced by the numbers of antibodies added to CoV-AbDab between September and November 2021 (see ** below), the OPIG team is clearly keeping up with developments in SARS-CoV-2 immunology. The dataset is rich in detail. In addition to the fields discussed earlier, other fields include data about the V and J genes, their sequences, and the sequences of the CDH3 and CDL3 regions. As for structures relevant to our question, a few improvements to the database would turn an excellent resource into an outstanding resource. CoV-AbDab is also a great example of why specialized databases are needed, another topic I’ve discussed in the past. Finally, this exploration provides an example for how an existing resource can be used to for introductory data science projects. 

* To lean more about immune reprotoires check out What is Immunoprofiling and Immunoprofiling: How it works.
** I began writing this post in early Oct. At that time the database had approximately 3700 records. 902 were identified as neutralizing using the filters described above.

  1. antibody
  2. Antibody Engineering
  3. databases
  4. data science

Comments on this entry

There are no comments at this time.

Post a comment

You must log in to post comments.

Please keep comments relevant to this entry.

Line breaks and paragraphs are automatically converted. URLs (starting with http://) or email addresses will automatically be linked.