专栏名称: 生信宝典
生物信息分析入门、晋级和经验分享。Linux、R、Python学习教程;高通量测序数据分析学习教程;生信软件安装教程。所有内容均为原创分享,致力于从基础学习到提高整个过程。
分享
今天看啥  ›  专栏  ›  生信宝典

[原创]纳尼?Genbank中超200万条序列受污染!蛋白污染主要来源于一只蜘蛛?

生信宝典  · 公众号  · 生物  · 2020-02-26 08:20

Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. However, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. (Nature综述:2万字带你系统入门鸟枪法宏基因组实验和分析) Here we describe Conterminator, an efficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination in 114,035 sequences and 2,767 species in the NCBI Reference Sequence Database (RefSeq), 2,161,746 sequences and 6795 species in the GenBank database, and 14,132 protein sequences in the NR non-redundant protein database. Conterminator uncovers contamination in sequences spanning the whole range from draft genomes to “complete” model organism genomes. Our method, which scales linearly with input size, was able to process 3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node. We believe that Conterminator can become an important tool to ensure the quality of reference databases with particular importance for downstream metagenomic analyses. Source code (GPLv3): https://github.com/martin-steinegger/conterminator.


2020年1月26日约翰霍普金斯大学Steven Salzberg团队在预印本bioRxiv上发表题为Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank的研究内容,对NCBI Reference Sequence Database (RefSeq), GenBank databaseNR non-redundant protein database中的序列污染情况进行分析,发现在以上数据库中存在大量的序列污染情况。

其实我们早就意识到上传的序列应该会存在一定程度的污染,理由也是多种多样,比如来自试剂、实验室材料、样品处理或混样测序带来的交叉污染等,都会导致外源DNA的引入。而这种错误会导致更多问题。

上有政策下有对策,为了解决污染问题,NCBI应用了两种过滤手段来检测受污染的片段。首先,VecScreen用于检测合成序列(载体、衔接子、接头、引物等),其次,针对常见污染物的BLAST比对可识别更广泛的污染序列。尽管有这些过滤手段,但仍会发生序列污染,而且其检测仍然具有挑战性。

因为测序实验室是由人运行的,Homo sapiens来源的DNA是基因组项目污染的主要来源。例如,最近的一项研究表明,在细菌基因组草图中可以找到成千上万的人类DNA片段,其中许多片段被错误地翻译并注释为蛋白质。

作者提出了Conterminator(图1b),一种通过计算跨物种的局部比对来检测核苷酸和蛋白质数据库中污染的快速方法。它利用了Linclust提出的线性时间多对多比对算法,然后使用MMseqs2进行了详尽的比对。作者应用这种方法来量化核苷酸数据库GenbankRefSeq 以及NR蛋白数据库中当前的污染程度。


污染的产生和Conterminator的工作原理

如图所示


图1

FIG. 1. How contamination occurs and how Conterminator detects it. a, DNA extraction from an organism (red) is imperfect and often introduces contamination by other species (violet). DNA sequencing then generates short reads that are assembled into longer contigs. Contaminated DNA is typically assembled into separate, small contigs, but sometimes is erroneously included in the same contigs as DNA from source organism. Contigs may also be linked by scaffolding, which can produce scaffolds containing a mixture of different species. Final assemblies are submitted to GenBank, and higher-quality assemblies are entered in RefSeq. b, Conterminator detects contamination in proteins and nucleotide sequences across kingdoms; e.g., bacterial contaminants in plant genomes. The following describes the nucleotide contamination detection workflow. (1) We take taxonomically labeled input sequences and cut them into non-overlapping segments of length 1000 and extract a subset of k-mers. (2) We group the k-mers by sorting them and compute ungapped alignments between the first and all succeeding sequences per group. (3) We extract each region of the first sequence that has an alignment to other kingdoms that is longer than 100 amino acids (residues) with a sequence identity greater than 90 %. We perform an exhaustive alignment of the input sequence segments against the multi-kingdom regions. We offset the alignment’s start and end position to the respective coordinates in the input sequence. (4) We reconstruct contig lengths within scaffolds by searching for the scaffold breakpoints (indicated by N characters in the DNA sequence) on the left and right side from the alignment start and end position. We predict that contamination is present if an alignment hits a contig that is shorter than 20 kb that aligns to a different kingdom with an alignment length longer than 20 kb.


GenBank中,超过95%的污染发生在真核基因组中

图2总结了ConterminatorRefSeq(图2a,b)和GenBank(图2c,d)中发现的污染。Conterminator报告了RefSeq和GenBank中分别具有114,035和2,161,746污染序列,影响物种数巨大。在GenBank中,超过95%的污染发生在真核基因组中。真核生物由于其基因组更大和更高的重复含量(与原核生物相比),基因组组装中的许多较小contigs都发现有污染。

在RefSeq中,只有52%的污染发生在真核基因组中。造成这种情况的一个可能原因是,用于确定RefSeq中包含哪些GenBank基因组的过滤更为严格。在RefSeq中被识别为污染物的物种(即引起污染的物种)数量为2881,在GenBank中为13,981。主要的污染物种类是Homo sapiens, Saccharomyces cerevisiae, Stenotrophomonas maltophilia and Serratia marcescens(见图1)。

图2

FIG. 2. Results of contamination within the RefSeq. a Distribution of contaminated species in RefSeq across five kingdoms: Bacteria&Archaea (violet), Fungi (yellow), Metazoa (red), Viridiplantae (green) and other Eukaryotes (turquoise). b Sankey plot of the top 13 contaminated species in RefSeq. We show the taxonomic ranks domain, kingdom, phylum and species. Numbers show above each taxonomic node indicate the total number of contaminated sequences. The tree uses the same color code for kingdoms as in a. c, d Same as a,b but for GenBank. ()


人基因组序列中的A. thiooxidans (嗜酸氧化硫硫杆菌)

人类参考基因组(当前为GRCh38)由chromosomal scaffolds, unplaced scaffolds, and “alternate” scaffolds组成。在NT 187580 (GRCh38.p13的10号染色体上的一个alternate scaffold)中,作者检测到一段跨越169,917–188,315位长度为188,315个碱基对的序列与嗜酸氧化硫硫杆菌匹配(图3a)。而其前半部分(1-169,918位)则与人的10号染色体完全匹配。因此,该人类alternate scaffold的最后〜18 kb似乎是细菌序列。 (NGS基础 - 参考基因组和基因注释文件)

图3

FIG. 3. Contamination in the reference genomes of Homo Sapiens and Caenorhabditis elegans. a Alignment of Homo sapiens alternative scaffold NT 187580 of chromosome 10 against RefSeq. Chromosome 10 (NC 000011.10) aligns with 100 % sequence identity from position 1 to 169,918. The remaining 18,397 residues of NT 187580 align only to Acidithiobacillus thiooxidans at 98 % sequence identity. Shown are only 6 out of 15 alignments to Acidithiobacillus thiooxidans. b The X chromosome of Caenorhabditis elegans NC 003284.9 aligns on the left and right flanking position around 5,907,856 until 5,912,458. E. coli genomes aligns from 5,907,856 to 5,912,087, a total of 4231 residues. Shown are only 3 out of 8199 alignments to E. coli.


蛋白数据库的污染主要来源于一只蜘蛛

我们检测到受污染的RefSeq contigs中有19.4%包含蛋白质注释,并且总共编码47,943个蛋白质。

图4

FIG. 4. Multiple sequence alignment of 31 spurious bacterial proteins encoded on short contaminated contigs Shown here are 31 out of 185 spurious proteins from bacterial genomes. A majority of the sequences are 100 % identical. The only differing residues are highlighted in white. This highly conserved “protein” is conserved on across different bacterial phyla, suggesting it is likely a contaminant that has been erroneously translated as part of automated annotation procedures. The respective short contigs (< 1 kb) encoding these spurious proteins align with high sequence identity and coverage to the Ovis aries genome.


作者预测了14,132种蛋白质代表污染物,Uniprot数据库中也存在7359种蛋白质。这些蛋白质中的大多数(70.46%)来自于真核生物,其余29.34%来自于细菌。超过6114种污染物蛋白质来自节肢动物门,而其中2401种来自Trichonephila clavipes, the golden silk orb weaver spider(金丝织网蜘蛛),这是造成污染最多的原因。

(左边为雌性,右边为雄性)


而比较分析又发现这个蜘蛛的基因组受到一种2019年刚分离出的海洋细菌Gemmobacter lutimaris sp YJ-T1-11的污染。这关系复杂的也没谁了!

评论

其实污染这个问题由来已久,并且随着时间的增长,序列污染的情况和比例也日益严重。

“我们检测到2012年细菌和古细菌的污染情况还只有2%-3%,” NCBI总监Lipman说,“但之后就急速攀升,这一比率在2014年已经接近10%。到2015年就达到了23%”。

Sanger研究所的科学家们也发现,DNA提取试剂盒、化学试剂和实验室环境中的杂菌很容易造成污染,影响微生物组分析的结果。

研究人员发现,没有污染的话对照样本应该只有一种菌,但有时却出现了270种不同的细菌。与高生物量的样本相比(粪便样本),来自血液或肺部的低生物量样本尤其容易受到污染。

其实去除污染的工具也已经出现,如Edwards研究组开发的DeconSeq,需要用户输入污染物的物种属性,然后再自动剔除基因组组装内容里属于这一物种的序列。

但我个人一直认为预防总比后期去除污染要重要的多,在实验操作过程中的规范合理也比后期用不知道靠不靠谱的工具进行处理要正确的多。

参考文献

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank(doi: https://doi.org/10.1101)

推荐阅读

往期精品(点击图片直达文字对应教程)


后台回复“生信宝典福利第一波”或点击阅读原文获取教程合集


今天看啥 -
本文地址:http://www.jintiankansha.me/t/Z6R8Lg8Wav