Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. However, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. (Nature综述：2万字带你系统入门鸟枪法宏基因组实验和分析) Here we describe Conterminator, an efficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination in114,035 sequences and2,767species in the NCBI Reference Sequence Database (RefSeq),2,161,746sequences and 6795 species in the GenBank database, and14,132protein sequences in the NR non-redundant protein database. Conterminator uncovers contamination in sequences spanning the whole range from draft genomes to “complete” model organism genomes. Our method, which scales linearly with input size, was able to process 3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node. We believe that Conterminator can become an important tool to ensure the quality of reference databases with particular importance for downstream metagenomic analyses. Source code (GPLv3): https://github.com/martin-steinegger/conterminator.
2020年1月26日约翰霍普金斯大学Steven Salzberg团队在预印本bioRxiv上发表题为Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank的研究内容，对NCBI Reference Sequence Database (RefSeq), GenBank database和NR non-redundant protein database中的序列污染情况进行分析，发现在以上数据库中存在大量的序列污染情况。
FIG. 1. How contamination occurs and how Conterminator detects it. a, DNA extraction from an organism (red) is imperfect and often introduces contamination by other species (violet). DNA sequencing then generates short reads that are assembled into longer contigs. Contaminated DNA is typically assembled into separate, small contigs, but sometimes is erroneously included in the same contigs as DNA from source organism. Contigs may also be linked by scaffolding, which can produce scaffolds containing a mixture of different species. Final assemblies are submitted to GenBank, and higher-quality assemblies are entered in RefSeq. b, Conterminator detects contamination in proteins and nucleotide sequences across kingdoms; e.g., bacterial contaminants in plant genomes. The following describes the nucleotide contamination detection workflow. (1) We take taxonomically labeled input sequences and cut them into non-overlapping segments of length 1000 and extract a subset of k-mers. (2) We group the k-mers by sorting them and compute ungapped alignments between the first and all succeeding sequences per group. (3) We extract each region of the first sequence that has an alignment to other kingdoms that is longer than 100 amino acids (residues) with a sequence identity greater than 90 %. We perform an exhaustive alignment of the input sequence segments against the multi-kingdom regions. We offset the alignment’s start and end position to the respective coordinates in the input sequence. (4) We reconstruct contig lengths within scaffolds by searching for the scaffold breakpoints (indicated by N characters in the DNA sequence) on the left and right side from the alignment start and end position. We predict that contamination is present if an alignment hits a contig that is shorter than 20 kb that aligns to a different kingdom with an alignment length longer than 20 kb.
在RefSeq中，只有52％的污染发生在真核基因组中。造成这种情况的一个可能原因是，用于确定RefSeq中包含哪些GenBank基因组的过滤更为严格。在RefSeq中被识别为污染物的物种（即引起污染的物种）数量为2881，在GenBank中为13,981。主要的污染物种类是Homo sapiens, Saccharomyces cerevisiae, Stenotrophomonas maltophilia and Serratia marcescens（见图1）。
FIG. 2. Results of contamination within the RefSeq. a Distribution of contaminated species in RefSeq across five kingdoms: Bacteria&Archaea (violet), Fungi (yellow), Metazoa (red), Viridiplantae (green) and other Eukaryotes (turquoise). b Sankey plot of the top 13 contaminated species in RefSeq. We show the taxonomic ranks domain, kingdom, phylum and species. Numbers show above each taxonomic node indicate the total number of contaminated sequences. The tree uses the same color code for kingdoms as in a. c, d Same as a,b but for GenBank. （）
FIG. 3. Contamination in the reference genomes of Homo Sapiens and Caenorhabditis elegans. a Alignment of Homo sapiens alternative scaffold NT 187580 of chromosome 10 against RefSeq. Chromosome 10 (NC 000011.10) aligns with 100 % sequence identity from position 1 to 169,918. The remaining 18,397 residues of NT 187580 align only to Acidithiobacillus thiooxidans at 98 % sequence identity. Shown are only 6 out of 15 alignments to Acidithiobacillus thiooxidans. b The X chromosome of Caenorhabditis elegans NC 003284.9 aligns on the left and right flanking position around 5,907,856 until 5,912,458. E. coli genomes aligns from 5,907,856 to 5,912,087, a total of 4231 residues. Shown are only 3 out of 8199 alignments to E. coli.
FIG. 4. Multiple sequence alignment of 31 spurious bacterial proteins encoded on short contaminated contigs Shown here are 31 out of 185 spurious proteins from bacterial genomes. A majority of the sequences are 100 % identical. The only differing residues are highlighted in white. This highly conserved “protein” is conserved on across different bacterial phyla, suggesting it is likely a contaminant that has been erroneously translated as part of automated annotation procedures. The respective short contigs (< 1 kb) encoding these spurious proteins align with high sequence identity and coverage to the Ovis aries genome.
作者预测了14,132种蛋白质代表污染物，Uniprot数据库中也存在7359种蛋白质。这些蛋白质中的大多数（70.46％）来自于真核生物，其余29.34％来自于细菌。超过6114种污染物蛋白质来自节肢动物门，而其中2401种来自Trichonephila clavipes, the golden silk orb weaver spider(金丝织网蜘蛛)，这是造成污染最多的原因。