文章MSM_metagenomics(五):共现分析

欢迎大家关注全网生信学习者系列:

  • WX公zhong号:生信学习者
  • Xiao hong书:生信学习者
  • 知hu:生信学习者
  • CDSN:生信学习者2

介绍

本教程是使用一个Python脚本来分析多种微生物(即strains, species, genus等)的共现模式。

数据

大家通过以下链接下载数据:

Python packages required

Co-presence pattern analysis

使用step_curve_drawer.py 做共线性分析

  • 代码
python 复制代码
#!/usr/bin/env python

"""
NAME: step_curve_drawer.py
DESCRIPTION: This script is to analyze the co-prsense of multiple species in different categories,
             by drawing step curves.
"""

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sys
import argparse
import textwrap

def read_args(args):
    # This function is to parse arguments

    parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter,
                                    description = textwrap.dedent('''\
                                     This program is to do draw step curves to analyze co-presense of multiple species in different groups.
                                     '''),
                                    epilog = textwrap.dedent('''\
                                    examples:step_curve_drawer.py --abundance_table <abundance_table_w_md.tsv> --variable <variable_name> --species_number <nr_sps> --output <output.svg>
                                    '''))
    parser.add_argument('--abundance_table',
                        nargs = '?',
                        help = 'Input the MetaPhlAn4 abundance table which contains only a group of species one wants to analyze their co-presense state, with metadata being wedged.',
                        type = str,
                        default = None)

    parser.add_argument('--variable',
                        nargs = '?',
                        help = 'Specify the header of the variable in the metadata table you want to assess. For example, \
                        [Diet] variable columns has three categries - [vegan]/[Flexitarian]/[Omnivore].',
                        type = str,
                        default = None)

    parser.add_argument('--minimum_abundance',
                        nargs = '?',
                        help = 'Specify the minimum abundance used for determining presense. note: [0, 100] and [0.0] by default',
                        type = float,
                        default = 0.0)

    parser.add_argument('--species_number',
                        nargs = '?',
                        help = 'Specify the total number of multiple species in the analysis.',
                        type = int)


    parser.add_argument('--output',
                        nargs = '?',
                        help = 'Specify the output figure name.',
                        type = str,
                        default = None)
    parser.add_argument('--palette',
                        nargs = '?',
                        help = 'Input a tab-delimited mapping file where values are group names and keys are color codes.',
                        type = str,
                        default = None)

    return vars(parser.parse_args())

class PandasDealer:
    """
    This is an object for dealing pandas dataframe.
    """

    def __init__(self, df_):

        self.df_ = df_

    def read_csv(self):
        # Ths fucntion will read tab-delimitted file into a pandas dataframe.

        return pd.read_csv(self.df_, sep = '\t', index_col = False, low_memory=False)

    def rotate_df(self):
        # this function is to rotate the metaphlan-style table into tidy dataframe to ease searching work,

        df_ = self.read_csv()
        df_rows_lists = df_.values.tolist()
        rotated_df_dict = {df_.columns[0]: df_.columns[1:]}
        for i in df_rows_lists:
            rotated_df_dict[i[0]] = i[1:]

        rotated_df = pd.DataFrame.from_dict(rotated_df_dict)
        
        return rotated_df

class CopEstimator:

    def __init__(self, sub_df_md):
        self.sub_df_md = sub_df_md # sub_df_md: a subset of dataframe which contains only a group of species one wants to do co-presence analysis.

    def make_copresense_df(self, factor, total_species_nr, threshold = 0.0):
        # factor: the factor you want to assess the category percentage.
        # total_species_nr: specify the total number of species you want to do co-presense analysis.


        rotated_df = PandasDealer(self.sub_df_md)
        rotated_df = rotated_df.rotate_df()
        cols = rotated_df.columns[-total_species_nr: ].to_list() 
        categories = list(set(rotated_df[factor].to_list()))
        

        copresense = []
        cate_name = []
        ratios = []
        for c in categories:
            sub_df = rotated_df[rotated_df[factor] == c]
            species_group_df = sub_df[cols]
            species_group_df = species_group_df.apply(pd.to_numeric)
            species_group_df['total'] = species_group_df[cols].gt(threshold).sum(axis=1)
            for i in range(1, total_species_nr + 1):
                ratio = count_non_zero_rows(species_group_df, i)
                copresense.append(i)
                cate_name.append(c)
                ratios.append(ratio)

        return pd.DataFrame.from_dict({"copresense": copresense,
                                        factor: cate_name,
                                        "percentage": ratios})

def count_non_zero_rows(df_, nr):
    total_rows = len(df_.index)
    
    sub_df = df_[df_['total'] >= nr]
    ratio = len(sub_df.index)/total_rows

    return ratio
    

class VisualTools:
    def __init__(self, processed_df, factor):
        self.processed_df = processed_df
        self.factor = factor

    def step_curves(self, opt_name, palette = None):
        categories = list(set(self.processed_df[self.factor].to_list()))
        if palette:
            palette_dict = {i.rstrip().split('\t')[0]: i.rstrip().split('\t')[1] for i in open(palette).readlines()}
            for c in categories:
                sub_df = self.processed_df[self.processed_df[self.factor] == c]
                plt.step(sub_df["percentage"]*100, sub_df["copresense"], label = c, color = palette_dict[c])
        else:
            for c in categories:
                sub_df = self.processed_df[self.processed_df[self.factor] == c]
                plt.step(sub_df["percentage"]*100, sub_df["copresense"], label = c)

        plt.title("Number of species in an individual if present")
        plt.xlabel("Percentage")
        plt.ylabel("Co-presense")
        plt.legend(title = self.factor)
        plt.savefig(opt_name, bbox_inches = "tight")


if __name__ == "__main__":

    pars = read_args(sys.argv)
    cop_obj = CopEstimator(pars['abundance_table'])
    p_df = cop_obj.make_copresense_df(pars['variable'], pars['species_number'], pars['minimum_abundance'])
    vis_obj = VisualTools(p_df, pars['variable'])
    vis_obj.step_curves(pars['output'], palette = pars['palette'])
  • 用法
python 复制代码
usage: step_curve_drawer.py [-h] [--abundance_table [ABUNDANCE_TABLE]] [--variable [VARIABLE]] [--minimum_abundance [MINIMUM_ABUNDANCE]] [--species_number [SPECIES_NUMBER]] [--output [OUTPUT]]
                            [--palette [PALETTE]]

This program is to do draw step curves to analyze co-presense of multiple species in different groups.

optional arguments:
  -h, --help            show this help message and exit
  --abundance_table [ABUNDANCE_TABLE]
                        Input the MetaPhlAn4 abundance table which contains only a group of species one wants to analyze their co-presense state, with metadata being wedged.
  --variable [VARIABLE]
                        Specify the header of the variable in the metadata table you want to assess. For example, [Diet] variable columns has three categries - [vegan]/[Flexitarian]/[Omnivore].
  --minimum_abundance [MINIMUM_ABUNDANCE]
                        Specify the minimum abundance used for determining presense. note: [0, 100] and [0.0] by default
  --species_number [SPECIES_NUMBER]
                        Specify the total number of multiple species in the analysis.
  --output [OUTPUT]     Specify the output figure name.
  --palette [PALETTE]   Input a tab-delimited mapping file where values are group names and keys are color codes.

examples:

python step_curve_drawer.py --abundance_table <abundance_table_w_md.tsv> --variable <variable_name> --species_number <nr_sps> --output <output.svg>

为了演示step_curve_drawer.py的使用,我们将绘制基于metaphlan相对丰度表特定于Segatalla copri (之前称为Prevotella copri )的八个谱系:./data/mpa4_pcopri_abundances_md.tsv的共现模式,这些数据来自MSMNon-MSM 人群。MSMNon-MSM 样本将使用自定义颜色进行标记,颜色分配来自一个颜色映射文件color map file: ./data/copresence_color_map.tsv

bash 复制代码
python step_curve_drawer.py \
  --abundance_table mpa_pcopri_abundances_md.tsv \
  --variable sexual_orientation \
  --species_number 8 \
  --palette copresence_color_map.tsv \
  --output copresence_plot.png
相关推荐
hboot3 天前
AI工程师第二课 - 数据处理
人工智能·python·数据分析
王小王-1234 天前
基于 Hive 的网易云音乐数据分析及可视化系统
hive·hadoop·数据分析·音乐数据分析·网易云音乐分析·hive音乐分析·hadoop网易云
Database_Cool_4 天前
大规模数据分析降本指南:AnalyticDB Serverless 弹性架构实战
数据仓库·阿里云·架构·数据分析·serverless
YangYang9YangYan4 天前
2026初入职场学习数据分析的价值
学习·数据挖掘·数据分析
砚底藏山河4 天前
沪深A股:如何获取基金持股数据
java·python·数据分析·maven
GZ同学4 天前
单双变量Ripley’s K函数 R 语言实现
开发语言·r语言
jarreyer4 天前
【数据分析绘图】excel绘图和bi工具区别
数据挖掘·数据分析·excel
chatexcel4 天前
ChatExcel Max使用教程:图片、PDF、网页与复杂Excel的一站式数据分析
数据分析·pdf·excel
yangshicong4 天前
第16章:AI数据分析与Text-to-SQL
人工智能·python·sql·数据分析·langchain