文章MSM_metagenomics(五):共现分析

欢迎大家关注全网生信学习者系列:

  • WX公zhong号:生信学习者
  • Xiao hong书:生信学习者
  • 知hu:生信学习者
  • CDSN:生信学习者2

介绍

本教程是使用一个Python脚本来分析多种微生物(即strains, species, genus等)的共现模式。

数据

大家通过以下链接下载数据:

Python packages required

Co-presence pattern analysis

使用step_curve_drawer.py 做共线性分析

  • 代码
python 复制代码
#!/usr/bin/env python

"""
NAME: step_curve_drawer.py
DESCRIPTION: This script is to analyze the co-prsense of multiple species in different categories,
             by drawing step curves.
"""

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sys
import argparse
import textwrap

def read_args(args):
    # This function is to parse arguments

    parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter,
                                    description = textwrap.dedent('''\
                                     This program is to do draw step curves to analyze co-presense of multiple species in different groups.
                                     '''),
                                    epilog = textwrap.dedent('''\
                                    examples:step_curve_drawer.py --abundance_table <abundance_table_w_md.tsv> --variable <variable_name> --species_number <nr_sps> --output <output.svg>
                                    '''))
    parser.add_argument('--abundance_table',
                        nargs = '?',
                        help = 'Input the MetaPhlAn4 abundance table which contains only a group of species one wants to analyze their co-presense state, with metadata being wedged.',
                        type = str,
                        default = None)

    parser.add_argument('--variable',
                        nargs = '?',
                        help = 'Specify the header of the variable in the metadata table you want to assess. For example, \
                        [Diet] variable columns has three categries - [vegan]/[Flexitarian]/[Omnivore].',
                        type = str,
                        default = None)

    parser.add_argument('--minimum_abundance',
                        nargs = '?',
                        help = 'Specify the minimum abundance used for determining presense. note: [0, 100] and [0.0] by default',
                        type = float,
                        default = 0.0)

    parser.add_argument('--species_number',
                        nargs = '?',
                        help = 'Specify the total number of multiple species in the analysis.',
                        type = int)


    parser.add_argument('--output',
                        nargs = '?',
                        help = 'Specify the output figure name.',
                        type = str,
                        default = None)
    parser.add_argument('--palette',
                        nargs = '?',
                        help = 'Input a tab-delimited mapping file where values are group names and keys are color codes.',
                        type = str,
                        default = None)

    return vars(parser.parse_args())

class PandasDealer:
    """
    This is an object for dealing pandas dataframe.
    """

    def __init__(self, df_):

        self.df_ = df_

    def read_csv(self):
        # Ths fucntion will read tab-delimitted file into a pandas dataframe.

        return pd.read_csv(self.df_, sep = '\t', index_col = False, low_memory=False)

    def rotate_df(self):
        # this function is to rotate the metaphlan-style table into tidy dataframe to ease searching work,

        df_ = self.read_csv()
        df_rows_lists = df_.values.tolist()
        rotated_df_dict = {df_.columns[0]: df_.columns[1:]}
        for i in df_rows_lists:
            rotated_df_dict[i[0]] = i[1:]

        rotated_df = pd.DataFrame.from_dict(rotated_df_dict)
        
        return rotated_df

class CopEstimator:

    def __init__(self, sub_df_md):
        self.sub_df_md = sub_df_md # sub_df_md: a subset of dataframe which contains only a group of species one wants to do co-presence analysis.

    def make_copresense_df(self, factor, total_species_nr, threshold = 0.0):
        # factor: the factor you want to assess the category percentage.
        # total_species_nr: specify the total number of species you want to do co-presense analysis.


        rotated_df = PandasDealer(self.sub_df_md)
        rotated_df = rotated_df.rotate_df()
        cols = rotated_df.columns[-total_species_nr: ].to_list() 
        categories = list(set(rotated_df[factor].to_list()))
        

        copresense = []
        cate_name = []
        ratios = []
        for c in categories:
            sub_df = rotated_df[rotated_df[factor] == c]
            species_group_df = sub_df[cols]
            species_group_df = species_group_df.apply(pd.to_numeric)
            species_group_df['total'] = species_group_df[cols].gt(threshold).sum(axis=1)
            for i in range(1, total_species_nr + 1):
                ratio = count_non_zero_rows(species_group_df, i)
                copresense.append(i)
                cate_name.append(c)
                ratios.append(ratio)

        return pd.DataFrame.from_dict({"copresense": copresense,
                                        factor: cate_name,
                                        "percentage": ratios})

def count_non_zero_rows(df_, nr):
    total_rows = len(df_.index)
    
    sub_df = df_[df_['total'] >= nr]
    ratio = len(sub_df.index)/total_rows

    return ratio
    

class VisualTools:
    def __init__(self, processed_df, factor):
        self.processed_df = processed_df
        self.factor = factor

    def step_curves(self, opt_name, palette = None):
        categories = list(set(self.processed_df[self.factor].to_list()))
        if palette:
            palette_dict = {i.rstrip().split('\t')[0]: i.rstrip().split('\t')[1] for i in open(palette).readlines()}
            for c in categories:
                sub_df = self.processed_df[self.processed_df[self.factor] == c]
                plt.step(sub_df["percentage"]*100, sub_df["copresense"], label = c, color = palette_dict[c])
        else:
            for c in categories:
                sub_df = self.processed_df[self.processed_df[self.factor] == c]
                plt.step(sub_df["percentage"]*100, sub_df["copresense"], label = c)

        plt.title("Number of species in an individual if present")
        plt.xlabel("Percentage")
        plt.ylabel("Co-presense")
        plt.legend(title = self.factor)
        plt.savefig(opt_name, bbox_inches = "tight")


if __name__ == "__main__":

    pars = read_args(sys.argv)
    cop_obj = CopEstimator(pars['abundance_table'])
    p_df = cop_obj.make_copresense_df(pars['variable'], pars['species_number'], pars['minimum_abundance'])
    vis_obj = VisualTools(p_df, pars['variable'])
    vis_obj.step_curves(pars['output'], palette = pars['palette'])
  • 用法
python 复制代码
usage: step_curve_drawer.py [-h] [--abundance_table [ABUNDANCE_TABLE]] [--variable [VARIABLE]] [--minimum_abundance [MINIMUM_ABUNDANCE]] [--species_number [SPECIES_NUMBER]] [--output [OUTPUT]]
                            [--palette [PALETTE]]

This program is to do draw step curves to analyze co-presense of multiple species in different groups.

optional arguments:
  -h, --help            show this help message and exit
  --abundance_table [ABUNDANCE_TABLE]
                        Input the MetaPhlAn4 abundance table which contains only a group of species one wants to analyze their co-presense state, with metadata being wedged.
  --variable [VARIABLE]
                        Specify the header of the variable in the metadata table you want to assess. For example, [Diet] variable columns has three categries - [vegan]/[Flexitarian]/[Omnivore].
  --minimum_abundance [MINIMUM_ABUNDANCE]
                        Specify the minimum abundance used for determining presense. note: [0, 100] and [0.0] by default
  --species_number [SPECIES_NUMBER]
                        Specify the total number of multiple species in the analysis.
  --output [OUTPUT]     Specify the output figure name.
  --palette [PALETTE]   Input a tab-delimited mapping file where values are group names and keys are color codes.

examples:

python step_curve_drawer.py --abundance_table <abundance_table_w_md.tsv> --variable <variable_name> --species_number <nr_sps> --output <output.svg>

为了演示step_curve_drawer.py的使用,我们将绘制基于metaphlan相对丰度表特定于Segatalla copri (之前称为Prevotella copri )的八个谱系:./data/mpa4_pcopri_abundances_md.tsv的共现模式,这些数据来自MSMNon-MSM 人群。MSMNon-MSM 样本将使用自定义颜色进行标记,颜色分配来自一个颜色映射文件color map file: ./data/copresence_color_map.tsv

bash 复制代码
python step_curve_drawer.py \
  --abundance_table mpa_pcopri_abundances_md.tsv \
  --variable sexual_orientation \
  --species_number 8 \
  --palette copresence_color_map.tsv \
  --output copresence_plot.png
相关推荐
小白跃升坊1 天前
数据分析报表如何选?详解 DataEase 四大表格:明细表、汇总表、透视表与热力图的适用场景与选择策略
数据挖掘·数据分析·开源软件·数据可视化·dataease
wudl55661 天前
Pandas-之数据可视化
信息可视化·数据分析·pandas
Q26433650231 天前
【有源码】基于Hadoop+Spark的起点小说网大数据可视化分析系统-基于Python大数据生态的网络文学数据挖掘与可视化系统
大数据·hadoop·python·信息可视化·数据分析·spark·毕业设计
中杯可乐多加冰1 天前
基于网易CodeWave智能开发平台构建宝可梦图鉴
深度学习·低代码·ai·数据分析·数据采集·无代码·网易codewave征文
qq_436962181 天前
奥威BI:打破数据分析的桎梏,让决策更自由
人工智能·数据挖掘·数据分析
B站计算机毕业设计之家1 天前
大数据python招聘数据分析预测系统 招聘数据平台 +爬虫+可视化 +django框架+vue框架 大数据技术✅
大数据·爬虫·python·机器学习·数据挖掘·数据分析
Tiandaren2 天前
大模型应用03 || 函数调用 Function Calling || 概念、思想、流程
人工智能·算法·microsoft·数据分析
人大博士的交易之路2 天前
龙虎榜——20251106
大数据·数学建模·数据分析·缠论·缠中说禅·龙虎榜
YangYang9YangYan2 天前
中专服装设计专业职业发展指南
大数据·人工智能·数据分析
ClouGence2 天前
百草味数据架构升级实践:打造 Always Ready 的企业级数据平台
大数据·数据库·数据分析