文章MSM_metagenomics(五):共现分析

欢迎大家关注全网生信学习者系列:

  • WX公zhong号:生信学习者
  • Xiao hong书:生信学习者
  • 知hu:生信学习者
  • CDSN:生信学习者2

介绍

本教程是使用一个Python脚本来分析多种微生物(即strains, species, genus等)的共现模式。

数据

大家通过以下链接下载数据:

Python packages required

Co-presence pattern analysis

使用step_curve_drawer.py 做共线性分析

  • 代码
python 复制代码
#!/usr/bin/env python

"""
NAME: step_curve_drawer.py
DESCRIPTION: This script is to analyze the co-prsense of multiple species in different categories,
             by drawing step curves.
"""

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sys
import argparse
import textwrap

def read_args(args):
    # This function is to parse arguments

    parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter,
                                    description = textwrap.dedent('''\
                                     This program is to do draw step curves to analyze co-presense of multiple species in different groups.
                                     '''),
                                    epilog = textwrap.dedent('''\
                                    examples:step_curve_drawer.py --abundance_table <abundance_table_w_md.tsv> --variable <variable_name> --species_number <nr_sps> --output <output.svg>
                                    '''))
    parser.add_argument('--abundance_table',
                        nargs = '?',
                        help = 'Input the MetaPhlAn4 abundance table which contains only a group of species one wants to analyze their co-presense state, with metadata being wedged.',
                        type = str,
                        default = None)

    parser.add_argument('--variable',
                        nargs = '?',
                        help = 'Specify the header of the variable in the metadata table you want to assess. For example, \
                        [Diet] variable columns has three categries - [vegan]/[Flexitarian]/[Omnivore].',
                        type = str,
                        default = None)

    parser.add_argument('--minimum_abundance',
                        nargs = '?',
                        help = 'Specify the minimum abundance used for determining presense. note: [0, 100] and [0.0] by default',
                        type = float,
                        default = 0.0)

    parser.add_argument('--species_number',
                        nargs = '?',
                        help = 'Specify the total number of multiple species in the analysis.',
                        type = int)


    parser.add_argument('--output',
                        nargs = '?',
                        help = 'Specify the output figure name.',
                        type = str,
                        default = None)
    parser.add_argument('--palette',
                        nargs = '?',
                        help = 'Input a tab-delimited mapping file where values are group names and keys are color codes.',
                        type = str,
                        default = None)

    return vars(parser.parse_args())

class PandasDealer:
    """
    This is an object for dealing pandas dataframe.
    """

    def __init__(self, df_):

        self.df_ = df_

    def read_csv(self):
        # Ths fucntion will read tab-delimitted file into a pandas dataframe.

        return pd.read_csv(self.df_, sep = '\t', index_col = False, low_memory=False)

    def rotate_df(self):
        # this function is to rotate the metaphlan-style table into tidy dataframe to ease searching work,

        df_ = self.read_csv()
        df_rows_lists = df_.values.tolist()
        rotated_df_dict = {df_.columns[0]: df_.columns[1:]}
        for i in df_rows_lists:
            rotated_df_dict[i[0]] = i[1:]

        rotated_df = pd.DataFrame.from_dict(rotated_df_dict)
        
        return rotated_df

class CopEstimator:

    def __init__(self, sub_df_md):
        self.sub_df_md = sub_df_md # sub_df_md: a subset of dataframe which contains only a group of species one wants to do co-presence analysis.

    def make_copresense_df(self, factor, total_species_nr, threshold = 0.0):
        # factor: the factor you want to assess the category percentage.
        # total_species_nr: specify the total number of species you want to do co-presense analysis.


        rotated_df = PandasDealer(self.sub_df_md)
        rotated_df = rotated_df.rotate_df()
        cols = rotated_df.columns[-total_species_nr: ].to_list() 
        categories = list(set(rotated_df[factor].to_list()))
        

        copresense = []
        cate_name = []
        ratios = []
        for c in categories:
            sub_df = rotated_df[rotated_df[factor] == c]
            species_group_df = sub_df[cols]
            species_group_df = species_group_df.apply(pd.to_numeric)
            species_group_df['total'] = species_group_df[cols].gt(threshold).sum(axis=1)
            for i in range(1, total_species_nr + 1):
                ratio = count_non_zero_rows(species_group_df, i)
                copresense.append(i)
                cate_name.append(c)
                ratios.append(ratio)

        return pd.DataFrame.from_dict({"copresense": copresense,
                                        factor: cate_name,
                                        "percentage": ratios})

def count_non_zero_rows(df_, nr):
    total_rows = len(df_.index)
    
    sub_df = df_[df_['total'] >= nr]
    ratio = len(sub_df.index)/total_rows

    return ratio
    

class VisualTools:
    def __init__(self, processed_df, factor):
        self.processed_df = processed_df
        self.factor = factor

    def step_curves(self, opt_name, palette = None):
        categories = list(set(self.processed_df[self.factor].to_list()))
        if palette:
            palette_dict = {i.rstrip().split('\t')[0]: i.rstrip().split('\t')[1] for i in open(palette).readlines()}
            for c in categories:
                sub_df = self.processed_df[self.processed_df[self.factor] == c]
                plt.step(sub_df["percentage"]*100, sub_df["copresense"], label = c, color = palette_dict[c])
        else:
            for c in categories:
                sub_df = self.processed_df[self.processed_df[self.factor] == c]
                plt.step(sub_df["percentage"]*100, sub_df["copresense"], label = c)

        plt.title("Number of species in an individual if present")
        plt.xlabel("Percentage")
        plt.ylabel("Co-presense")
        plt.legend(title = self.factor)
        plt.savefig(opt_name, bbox_inches = "tight")


if __name__ == "__main__":

    pars = read_args(sys.argv)
    cop_obj = CopEstimator(pars['abundance_table'])
    p_df = cop_obj.make_copresense_df(pars['variable'], pars['species_number'], pars['minimum_abundance'])
    vis_obj = VisualTools(p_df, pars['variable'])
    vis_obj.step_curves(pars['output'], palette = pars['palette'])
  • 用法
python 复制代码
usage: step_curve_drawer.py [-h] [--abundance_table [ABUNDANCE_TABLE]] [--variable [VARIABLE]] [--minimum_abundance [MINIMUM_ABUNDANCE]] [--species_number [SPECIES_NUMBER]] [--output [OUTPUT]]
                            [--palette [PALETTE]]

This program is to do draw step curves to analyze co-presense of multiple species in different groups.

optional arguments:
  -h, --help            show this help message and exit
  --abundance_table [ABUNDANCE_TABLE]
                        Input the MetaPhlAn4 abundance table which contains only a group of species one wants to analyze their co-presense state, with metadata being wedged.
  --variable [VARIABLE]
                        Specify the header of the variable in the metadata table you want to assess. For example, [Diet] variable columns has three categries - [vegan]/[Flexitarian]/[Omnivore].
  --minimum_abundance [MINIMUM_ABUNDANCE]
                        Specify the minimum abundance used for determining presense. note: [0, 100] and [0.0] by default
  --species_number [SPECIES_NUMBER]
                        Specify the total number of multiple species in the analysis.
  --output [OUTPUT]     Specify the output figure name.
  --palette [PALETTE]   Input a tab-delimited mapping file where values are group names and keys are color codes.

examples:

python step_curve_drawer.py --abundance_table <abundance_table_w_md.tsv> --variable <variable_name> --species_number <nr_sps> --output <output.svg>

为了演示step_curve_drawer.py的使用,我们将绘制基于metaphlan相对丰度表特定于Segatalla copri (之前称为Prevotella copri )的八个谱系:./data/mpa4_pcopri_abundances_md.tsv的共现模式,这些数据来自MSMNon-MSM 人群。MSMNon-MSM 样本将使用自定义颜色进行标记,颜色分配来自一个颜色映射文件color map file: ./data/copresence_color_map.tsv

bash 复制代码
python step_curve_drawer.py \
  --abundance_table mpa_pcopri_abundances_md.tsv \
  --variable sexual_orientation \
  --species_number 8 \
  --palette copresence_color_map.tsv \
  --output copresence_plot.png
相关推荐
AI职业加油站37 分钟前
职业提升之路:我的大数据分析师学习与备考分享
大数据·人工智能·经验分享·学习·职场和发展·数据分析
jiang_changsheng18 小时前
环境管理工具全景图与深度对比
java·c语言·开发语言·c++·python·r语言
爱吃泡芙的小白白19 小时前
环境数据可视化利器:Hexbin Chart 全解析与应用实战
信息可视化·数据挖掘·数据分析·环境领域·hexbin chart
爱吃泡芙的小白白19 小时前
环境数据可视化利器:气泡图(Bubble Chart)全解析
信息可视化·数据挖掘·数据分析·气泡图·bubble chart·环境领域
JicasdC123asd20 小时前
使用Faster R-CNN模型训练汽车品牌与型号检测数据集 改进C4结构 优化汽车识别系统 多类别检测 VOC格式
r语言·cnn·汽车
沐墨染21 小时前
Vue实战:自动化研判报告组件的设计与实现
前端·javascript·信息可视化·数据分析·自动化·vue
sensen_kiss1 天前
Jupter Notebook 使用教程
大数据·人工智能·python·学习·数据分析
deephub1 天前
分类数据 EDA 实战:如何发现隐藏的层次结构
人工智能·python·机器学习·数据分析·数据可视化
洁洁!1 天前
JDK21→25升级实战:飞算Java AI专业版帮我自动适配了哪些坑?
人工智能·科技·语言模型·数据分析·飞算javaai·ai开发工具
BYSJMG1 天前
计算机毕设推荐:基于大数据的共享单车数据可视化分析
大数据·后端·python·信息可视化·数据分析·课程设计