python脚本过滤得到non-overlap的utr

使用该脚本对上述的结果"lin_20240321_calculating_rG4score.R"进行过滤

python 复制代码
import csv

def read_file(file_path):
    with open(file_path, 'r') as file:
        reader = csv.DictReader(file, delimiter='\t')
        return list(reader)

def process_sequences(data):
    gene_sequences = {}
    for row in data:
        gene_id = row['Id']
        start = int(row['Start'])
        end = int(row['End'])
        length=int(row['total_length'])
        score = float(row['G4Hscore'])

        if gene_id not in gene_sequences:
            gene_sequences[gene_id] = []

        gene_sequences[gene_id].append({
            'Type': row['Type'],
            'Start': start,
            'End': end,
            'Length': length,
            'Sequence': row['Sequence'],
            'Score': score
        })

    # 对每个基因的序列按分数降序排序
    for gene_id, sequences in gene_sequences.items():
        gene_sequences[gene_id] = sorted(sequences, key=lambda x: x['Score'], reverse=True)

    # 保留分数最高且不重叠的序列
    final_selection = {}
    for gene_id, sequences in gene_sequences.items():
        final_selection[gene_id] = []
        for seq in sequences:
            if not any(seq['Start'] < s['End'] and seq['End'] > s['Start'] for s in final_selection[gene_id]):
                final_selection[gene_id].append(seq)

    return final_selection

def write_results(gene_sequences, output_file):
    with open(output_file, 'w', newline='') as file:
        writer = csv.writer(file, delimiter='\t')
        writer.writerow(['Id', 'Type', 'Start', 'End', 'Total_length','Sequence', 'Score'])
        for gene_id, sequences in gene_sequences.items():
            for seq in sequences:
                writer.writerow([gene_id, seq['Type'], seq['Start'], seq['End'], seq['Length'], seq['Sequence'], seq['Score']])

# 输入和输出文件路径
#usage:python lin_filter_non-overlap_rg4.py -f1 lijinonextended_3utr_allrg4output1.fasta -f2 lijinonextended_3utr_allrg4output2.fasta
import argparse
parser = argparse.ArgumentParser(description="Advanced screening always by hash")
parser.add_argument("-f1","--file1",help="input1")
parser.add_argument("-f2","--file2",help="input2")
args = parser.parse_args()

# 读取文件
data = read_file(args.file1)
# 处理序列,保留得分最高且不重叠的序列
gene_sequences = process_sequences(data)
# 将结果写入新文件
write_results(gene_sequences, args.file2)
相关推荐
apocelipes13 分钟前
POSIX兼容系统上read和write系统调用的行为总结
linux·c语言·c++·python·golang·linux编程
暴风鱼划水23 分钟前
算法题(Python)数组篇 | 6.区间和
python·算法·数组·区间和
Derrick__137 分钟前
Web Js逆向——加密参数定位方法(Hook)
python·js
南汐汐月1 小时前
重生归来,我要成功 Python 高手--day33 决策树
开发语言·python·决策树
lzjava20241 小时前
Spring AI使用知识库增强对话功能
人工智能·python·spring
B站_计算机毕业设计之家1 小时前
深度血虚:Django水果检测识别系统 CNN卷积神经网络算法 python语言 计算机 大数据✅
python·深度学习·计算机视觉·信息可视化·分类·cnn·django
Q_Q5110082851 小时前
python+django/flask的校园活动中心场地预约系统
spring boot·python·django·flask·node.js·php
工会主席-阿冰1 小时前
数据索引是无序时,直接用这个数据去画图的话,显示的图是错误的
开发语言·python·数据挖掘
Naiva2 小时前
【小技巧】PyCharm建立项目,VScode+CodeX+WindowsPowerShell开发Python pyQT6 (二)
vscode·python·pycharm
Lucifer__hell2 小时前
【python+tkinter】图形界面简易计算器的实现
开发语言·python·tkinter