【工具】isolateR桑格测序数据的自动化处理、分类分析以及微生物菌株库的生成R包

文章目录

    • 介绍
    • 代码
    • 案例
      • [Step 1: isoQC - Automated quality trimming of sequences](#Step 1: isoQC - Automated quality trimming of sequences)
      • [Step 2: isoTAX - Assign taxonomy](#Step 2: isoTAX - Assign taxonomy)
      • [Step 3: isoLIB - Generate strain library](#Step 3: isoLIB - Generate strain library)
    • 参考

介绍

对分类标记基因(如16S/18S/ITS/rpoB/cpn60)进行桑格测序是鉴定包括细菌、古菌和真菌在内的广泛微生物的领先方法。然而,序列数据的手动处理以及传统BLAST搜索的局限性阻碍了菌株库的高效生成,而菌株库对于编目微生物多样性和发现新物种至关重要。

isolateR通过实施标准化且可扩展的三步流程来应对这些挑战,包括:(1)桑格序列文件的自动化批量处理,(2)通过与类型菌株数据库进行全局比对进行分类鉴定,符合最新的国际命名标准,(3)简单创建菌株库并处理克隆分离株,能够设置可定制的序列去重复阈值,并将多次测序运行的数据合并到一个库中。该工具的用户友好设计还具有交互式HTML输出,简化了数据探索和分析。此外,在两个全面的人类肠道基因组目录(IMGG和哈扎狩猎采集人群)上进行的计算机模拟基准测试展示了isolateR在揭示和编目微生物多样性的细微谱系方面的熟练程度,倡导在个体宿主内进行更有针对性和更细致的探索,以在生成菌株库时实现尽可能高的菌株级分辨率。

Abstract

Motivation

Sanger sequencing of taxonomic marker genes (e.g. 16S/18S/ITS/rpoB/cpn60) represents the leading method for identifying a wide range of microorganisms including bacteria, archaea, and fungi. However, the manual processing of sequence data and limitations associated with conventional BLAST searches impede the efficient generation of strain libraries essential for cataloging microbial diversity and discovering novel species.
Results

isolateR addresses these challenges by implementing a standardized and scalable three-step pipeline that includes: (1) automated batch processing of Sanger sequence files, (2) taxonomic classification via global alignment to type strain databases in accordance with the latest international nomenclature standards, and (3) straightforward creation of strain libraries and handling of clonal isolates, with the ability to set customizable sequence dereplication thresholds and combine data from multiple sequencing runs into a single library. The tool's user-friendly design also features interactive HTML outputs that simplify data exploration and analysis. Additionally, in silico benchmarking done on two comprehensive human gut genome catalogues (IMGG and Hadza hunter-gather populations) showcase the proficiency of isolateR in uncovering and cataloging the nuanced spectrum of microbial diversity, advocating for a more targeted and granular exploration within individual hosts to achieve the highest strain-level resolution possible when generating culture collections.

代码

https://github.com/bdaisley/isolateR

案例

安装包

r 复制代码
#Install BiocManager if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

#Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE))
  install.packages("devtools")
  
#Install the required Bioconductor dependencies
BiocManager::install(c("Biostrings", "msa", "sangeranalyseR", "sangerseqR"), update=FALSE)

#Install isolateR
devtools::install_github("bdaisley/isolateR")

Step 1: isoQC - Automated quality trimming of sequences

r 复制代码
library(isolateR)

#Set path of directory where the .ab1 files. In this case, using example dataset in R
fpath1 <- system.file("extdata/abif_examples/rocket_salad", package = "isolateR")

isoQC.S4 <- isoQC(input=fpath1,
                  export_html=TRUE,
                  export_csv=TRUE,
                  export_fasta=TRUE,
                  verbose=FALSE,
                  min_phred_score = 20,
                  min_length = 200,
                  sliding_window_cutoff = NULL,
                  sliding_window_size = 15,
                  date=NULL)

Step 2: isoTAX - Assign taxonomy

r 复制代码
#Specify location of CSV output from 'isoQC' step containing quality trimmed sequences
fpath2 <- file.path(fpath1, "isolateR_output/01_isoQC_trimmed_sequences_PASS.csv")

isoTAX.S4 <- isoTAX(input=fpath2,
                    export_html=TRUE,
                    export_csv=TRUE,
                    db="16S",
                    quick_search=TRUE,
                    phylum_threshold=75.0,
                    class_threshold=78.5,
                    order_threshold=82.0,
                    family_threshold=86.5,
                    genus_threshold=96.5,
                    species_threshold=98.7)

Step 3: isoLIB - Generate strain library

r 复制代码
#Specify location of CSV output from isoTAX in Step 2
fpath3 <- file.path(fpath1, "isolateR_output/02_isoTAX_results.csv")

isoLIB.S4 <- isoLIB(input=fpath3,
		    old_lib_csv=NULL,
		    group_cutoff=0.995,
                    include_warnings=FALSE)

参考

相关推荐
小熊Coding9 小时前
重庆市旅游景点数据可视化分析系统
爬虫·python·数据挖掘·数据分析·计算机毕业设计·数据可视化分析·旅游景点
爱看科技12 小时前
经典卷积与量子技术牵手,微美全息(NASDAQ:WIMI)引领图像分类量子机器新航向
分类·数据挖掘·量子计算
高洁0115 小时前
什么是AI智能体(AI Agent)?
人工智能·数据挖掘·transformer·知识图谱
AI科技星16 小时前
基于v≡c公设的理论优化方案
c语言·开发语言·算法·机器学习·数据挖掘
邂逅you16 小时前
数据分析方法与框架
数学建模·数据挖掘·数据分析·ab测试·aarrr
MediaTea16 小时前
NumPy 应用实例:用户行为数据分析(归一化和标准化处理)
数据挖掘·数据分析·numpy
A_QXBlms16 小时前
《数据驱动防折叠:利用企微API与数据分析平台构建智能发送决策系统》
数据挖掘·数据分析·企业微信
badhope17 小时前
2025年3月AI领域纪录:从模型开源到智能体价值重估——风云变幻DLC
人工智能·python·深度学习·计算机视觉·数据挖掘
xiaoliuliu1234518 小时前
R语言4.5.0安装教程:详细步骤+自定义安装路径(64位)
开发语言·r语言
小陈工18 小时前
2026年3月30日技术资讯洞察:AI算力突破、云原生优化与架构理性回归
开发语言·人工智能·python·云原生·架构·数据挖掘·wasm