写在前面
本教程来自粉丝投稿,主要做得是顺式元件的预测和热图绘制。类似的教程,在我们基于TBtools做基因家族分析也做过,流程基本一致。我们前期的教程,主要是基于TBtools,本教程主要是基于纯代码,也是值得学习+收藏
的教程。
目的
在进行基因家族分析时,我们经常会对候选基因的启动子元件类型进行分析,该步骤主要是实现顺式元件热图+柱状堆积图的组合图。
方法
1. 数据准备及整理
1、ref #基因组:如果有多个基因组可以直接合并到一起,但注意染色体编号的区分,可以更改成染色体号_物种
>Chr01_Ath
ATGC....
>Chr01_Sol
ATGC....
2、bed #基因组注释文件(bed格式),如果只有gff,可以使用gff2bed等软件或者awk基础命令行进行更改
3、genelist #基因分组关系文件*.list,文件分两列,第一列为基因id,第二列为物种/聚类分组等信息
部分数据展示:
gene_id Genetype
geneA1 Ath
geneA2 Ath
geneA3 Ath
geneB1 Sol
geneB2 Sol
geneB3 Sol
2. 依赖软件
conda install -c bioconda seqkit
3. 分析步骤
1)获得启动子序列
启动子是DNA序列,可以被RNA 聚合酶识别、与其结合从而开始转录,它含有RNA 聚合酶特异性结合和转录起始所需的保守序列,多数位于结构基因转录起始点的上游,启动子本身不被转录。启动子的特性最初是通过能增加或降低基因转录速率的突变而鉴定的。启动子一般位于转录起始位点的上游几百至几千bp不等。
# 提取目的基因位置文件
grep -wf <(cut -f 1 ${genelist}) ${gff%.*}.bed >pick_merge.bed
# 提取基因启动子序列,一般默认上游3K
seqkit subseq -u 3000 --bed pick_merge.bed -f -o gene_up3000.fa ${ref} -w 0
2)元件预测
登录网站:(http://bioinformatics.psb.ugent.be/webtools/plantcare/html/),提交提取的启动子序列和邮箱号,预测结束后会将结果(plantCARE_output_PlantCARE_*.tab文件)发至邮箱。
## plantCARE_output_PlantCARE_*.tab
geneA1 CAT-box GCCACT 1003 6 - Arabidopsis thaliana cis-acting regulatory element related to meristem expression
geneA2 CAT-box GCCACT 1525 6 + Arabidopsis thaliana cis-acting regulatory element related to meristem expression
geneA3 LTR CCGAAA 459 6 - Hordeum vulgare cis-acting element involved in low-temperature responsiveness
#注意文件长度不等
数据解读:
A:ID
B:名称
C:motif
D:起始位置
E:motif 长度
F:正负链
G-H:功能描述
3)手动分类
根据最后一列的功能描述进行分类,例如生长素相关,光周期相关(英文标注)等等,将分类补充至最后一列。手动分类好后摘取关注分类的数据即可,并重新命名文件即可。(如:plant_class.tab)
#部分数据展示
geneA1 CAT-box GCCACT 1003 6 - Arabidopsis thaliana cis-acting regulatory element related to meristem expression cell cycle
geneA1 CAT-box GCCACT 1525 6 + Arabidopsis thaliana cis-acting regulatory element related to meristem expression cell cycle
geneA1 LTR CCGAAA 459 6 - Hordeum vulgare cis-acting element involved in low-temperature responsiveness low-temperature
geneA2 LTR CCGAAA 1178 6 + Hordeum vulgare cis-acting element involved in low-temperature responsiveness low-temperature
geneA2 TGACG-motif TGACG 61 5 + Hordeum vulgare cis-acting regulatory element involved in the MeJA-responsiveness MeJA
geneA2 TGACG-motif TGACG 89 5 + Hordeum vulgare cis-acting regulatory element involved in the MeJA-responsiveness MeJA
4)提取分类信息,准备绘图数据
awk -F '\t' 'NF==9' plant_class.tab >plant_class.tab.filt.xls
echo -e "type\tclass" >core_class.xls && cut -f 2,9 plant_class.tab.filt.xls |sort -u |sed -e "s/ /|/g" |sort -k 2 |sed -e "s/|/ /g" >>column_class.xls
python stat.py plant_class.tab.filt.xls row_anno.xls column_class.xls total.plot.xls
cut -f 2- stat_class.xls >plot_bar.xls
grep -vw '#AAA' stat_core.xls |cut -f 2- >plot_core.xls
join <(sort -k 1,1 plot_core.xls |sed -e 's/ /_/g' ) <(sort -k 1,1 plot_bar.xls |sed -e 's/ /_/g' ) |sed -e "s/\s\+/\t/g" -e "s/_/ /g" >total.plot.xls
#上述步骤整理得到绘图文件
total.plot.xls
5)作图,最后AI调整一下细节即可。
Rscript cis_heatmap_bar.r --ra new_row_anno.xls --class column_class.xls --plot total_plot.xls --out out.pdf
脚本
import sys
import pandas as pd
from collections import defaultdict
import numpy as np
col_names = ["gene","type","seq","s_pos","score","strand","spe","anno","group"]
df = pd.read_csv(sys.argv[1],sep="\t",names=col_names)
df2 = pd.read_csv(sys.argv[2],sep="\t")
df3 = pd.read_csv(sys.argv[3],sep="\t")
stat_yunjian_dic =defaultdict(list)
stat_type_dic = defaultdict(list)
# np.array(chr_group["group"].drop_duplicates()).tolist()
yuanjian_type = np.array(df3["type"].drop_duplicates()).tolist()
gene_id = np.array(df["gene"].drop_duplicates()).tolist()
class_id = np.array(df3["class"].drop_duplicates()).tolist()
cl = ["Class","#AAA"]
stat_yunjian_dic["gene_id"]=gene_id
stat_type_dic["gene_id"]=gene_id
for g in gene_id:
stat_yunjian_dic["spe"].append(df2[df2["gene_id"]== g].iloc[0,1])
stat_type_dic["spe"].append(df2[df2["gene_id"]== g].iloc[0,1])
for t in yuanjian_type:
num = len(df[(df["gene"]== g) & (df["type"] == t)])
stat_yunjian_dic[t].append(num)
for c in class_id:
c_num = len(df[(df["gene"]== g) & (df["group"] == c)])
stat_type_dic[c].append(c_num)
for t in yuanjian_type:
cl.append(df3[df3["type"] == t].iloc[0,1])
core_df =pd.DataFrame(data=stat_yunjian_dic)
#out_core_df.columns=cl
core_df.loc[len(core_df.index)] =cl
out_class_df = pd.DataFrame(data=stat_type_dic)
#sort
out_class_df.sort_values(by=["spe"] , inplace=True, ascending=True)
core_df.sort_values(by=["spe"] , inplace=True, ascending=True)
out_core_df = core_df.loc[:,["spe","gene_id"]+yuanjian_type]
out_class_df = out_class_df.loc[:,["spe","gene_id"]+class_id]
out_total=pd.merge(out_core_df,out_class_df,on='gene_id').drop('spe_x', axis=1)
out_total.to_csv(sys.argv[4],index=False,sep="\t")
library(tidyverse)
library(RColorBrewer)
library(ggplot2)
library(ComplexHeatmap)
library(colorRamp2)
library(argparser)
#设置参数类型、默认值及说明
argv <- arg_parser('')
argv <- add_argument(argv,"--ra", help="输入文件,row_anno,gene_id\tGenetype")
argv <- add_argument(argv,"--class", help="输入文件,core_class,type\tclass")
argv <- add_argument(argv,"--plot",help="输入文件,total_plot.xls")
argv <- add_argument(argv,"--out",default="pdf",help="输出文件类型:pdf/jpg/png")
argv <- parse_args(argv)
#将参数传入变量中
row_anno <- argv$ra
core_class<-argv$class
plot_core<-argv$plot
out <-argv$out
# 这些需要将导入的注释信息转化为data.frame 格式
annotation_row<-read.table(row_anno,header=T,sep="\t")
annotation_row <- as.data.frame(annotation_row)
annotation_col<-read.table(core_class,header=T,sep="\t")
annotation_col <- as.data.frame(annotation_col)
#颜色配置,可手动调整
col_fun = colorRamp2(c(0,3,5,10,15), c('#d9d9d9', '#addd8e', '#b10026', '#fd8d3c','#009db2'))
lgd_row_len <- length(unique(annotation_row$Genetype))
lgd_col_len <- length(unique(annotation_col$class))
core_len <-length(annotation_col$class)
d_s <- core_len+1
d_e <- core_len+lgd_col_len
#数据集
df <-read.table(plot,sep="\t",header=T,row.names = 1)
cell1 <-df[ ,1:core_len]
cell1[cell1==0]<-NA
data1 <- df[ ,d_s:d_e]
p1 = Heatmap(cell1,
col = col_fun,
na_col = "white",
# 去掉行列聚类:
cluster_rows = F,
cluster_columns = F,
row_names_side = "left",
column_names_side = "top",
#show_heatmap_legend = F,
# 行名和列名:
row_names_gp = gpar(fontsize = 10, font = 3),
column_names_gp = gpar(fontsize = 10, font = 3),
row_order = annotation_row$gene_id,
row_split = annotation_row$Genetype,#行截断(按照pathway,不像之前随机)
row_gap = unit(1, "mm"),
column_gap = unit(1, "mm"),
border = TRUE,
#column_order = annotation_col$type,
column_split = annotation_col$class,#列截断
#格子大小
heatmap_width = unit(0.5,"npc"),
heatmap_height = unit(0.5, "npc"),
#柱状图
right_annotation = rowAnnotation(
bar = anno_barplot(axis_param = list(side="top"),width = unit(30,"cm"),data.matrix(data1),gp=gpar(fill=rainbow(lgd_col_len)))
),
# 添加文字注释:
cell_fun = function(j, i, x, y, width, height, fill) {
if (!is.na(cell1[i,j])) {
grid.text(sprintf("%s", cell1[i, j]), x, y,
gp = gpar(fontsize = 8, col = "black"))
grid.rect(x, y, width, height,
gp = gpar(col = "grey", fill = NA, lwd = 0.8))
}else{
grid.rect(x, y, width, height,
gp = gpar(col = "grey", fill = NA, lwd = 0.8))
}
})
p2 <-Heatmap(t(annotation_col$class),
column_split = annotation_col$class,
col = rainbow(lgd_col_len),
cluster_columns = FALSE,
heatmap_height = unit(0.5, "npc"),
border = T)
pdf(out,height = 15,width = 30)
p1_p2 = p1 %v% p2
p1_p2
dev.off()
往期文章:
1. 最全WGCNA教程(替换数据即可出全部结果与图形)
--
2. 精美图形绘制教程
小杜的生信筆記,主要发表或收录生物信息学的教程,以及基于R的分析和可视化(包括数据分析,图形绘制等);分享感兴趣的文献和学习资料!!