Coronavirus Database

Overview

CoVdb extensively collects published coronavirus data and have taken in genomes of 5709 strains after the update in May 22, 2020. The strains were collected from 32 organisms and in the years from 1941 to present, 2020 (Figure 1). 3414 (59.8%) in CoVdb are human isolates and 217 (3.8%) are bat isolates, which are referred as the possible source of human coronavirus. Porcine coronavirus also take a big percentage (945, 16.6%) and coronavirus used to make damages in the pig industry. The number of documented human isolates varied in years and there are three peaks, which reflect the outbreaks of SARS-CoV in 2003, MERS-CoV in 2014-2015 and 2019-nCoV in 2019-2020, separately. Using all documented coronavirus genomes in CoVdb, we generated a phylogenetic tree (Figure 2A), from which we observed that the nearest non-human isolate to 2019-nCoV is Bat_MN996532 (Bat-CoV-RaTG13), isolated from Rhinolophus affinis, a species of bat in the Rhinolophidae family. Strains isolated from pangolins are also in the vicinity of 2019-nCoV. Pangolin was once considered as a potential intermediate host of 2019-nCoV. We developed search tools to enable users to search in the big phylogenetic tree (Figure 2B).

In average, there are 5-14 possible open reading frames (ORF) or genes in one coronavirus strain. We grouped homologous coronavirus genes (requiring identity > 0.5 and coverage > 0.8) into 628 clusters (for details, see Materials and Methods). This number indicates that the differentiation or diversity within coronavirus strains is not low. For these, we still performed a subcellular localization analysis for the 628 clusters to predict their roles in infection, although the structure of coronavirus is not complex. Base on prediction only, 21% (133 items) are predicted to be located in the host nucleus or host cytoplasm, while 40% (250 items) are predicted to be membrane proteins (Figure S1). CoVdb has included more than 50000 function annotations and more than 300000 GO records. Using WEGO we found coronavirus genes enrich in the membrane (Figure S2). We searched for possible protein 3D structure for coronavirus genes in the Protein Data Bank (PDB) and found more than 3000000 mappings with an E-value < 0.05 and a coverage > 50%. For all coronavirus strains, using 9 representative human coronavirus genomes as the reference, we did sliding window analyses on Pi, Tajima’s D, composite likelihood ratio (CLR) and fixation index (Fst). For Pi, Tajima’s D and CLR, the target group are strains belonging to 2019-nCoV, MERS, SARS, or other human coronavirus strains. We also did the same thing for human isolates, bat isolates, and isolates of other hosts documented in CoVdb. Fst is between human coronavirus and one non-human coronavirus. All these data can be viewed in the genome browser.

Figure 1. The distribution of documented coronavirus strains in CoVdb according to collection date (X axis) and hosts (colored by different colors). Y is the number a group of coronavirus strains. Red triangles points to peaks of the number of documented human coronavirus in time series.

Figure 2. A is a partial display of the phylogenetic tree built by all coronavirus genomes documented in CoVdb. B is a snapshot showing that users can search by strain name in a phylogenetic tree in CoVdb. Both A and B center on the split of Bat_MN996532 and 2019-nCoV.

Main Functions

The genome browser (GBrowser) in CoVdb follows a style with analysis tracks (CLR, Pi, Tajima’s D, and Fst) listed following gene segments. CoVdb also are equipped with other general genome browser tools, such as zoom in/out or position movements. In addition to basic information, CoVdb show gene information mainly in function annotation, subcellular localization, topology and protein structure. The search engine in CoVdb is powerful and supports fuzzy search, BLAT and BLAST. CoVdb also allows to search by cell location. For personalized analyses, CoVdb is able to provide gene links if inputting a list of genome positions or gene accessions.

CoVdb has tools to facilitate some specific use in coronavirus research, such as tracing origination, vaccine or drug design. In the tool “Protein”, genes’ 3D structure related information are listed, in which users can view the overlapped amino acids of a coronavirus protein in its 3D structure counterpart and do online protein structure analyses. Users also can search a protein sequence and view the mapped region in the target’s 3D structure. The tool “AlnBrowser” allows users to retrieve multiple alignment of two or more strains at some position and build a phylogenetic tree using the alignment (Figure 3). With the tool “PopAnalyzer”, users can do personalized online sliding window analyses. Users can choose the window size, the step size, the target region and the population genetic tests (Figure 4). “Phylo Tree” is a tool to view and search in phylogentic trees made by genomic or proteomic sequences (Figure 2B). Users also can go to the GBrowser page by clicking the name of one strain.

Figure 3. A snapshot displaying the usage of “Aln Browser”, where users need to select the reference strain, the start position, the end position and strains in the alignment. If clicking on “Retrieve Alignment”, the multiple alignment of selected strains will be shown below. If clicking on “Make Tree”, a phylogenetic tree based on the alignment will be made and shown at the bottom.

Figure 4. A snapshot of the tool “PopAnalyzer”.