使用 ChatGPT 为生物信息学初学者赋能

论文:Empowering Beginners in Bioinformatics with ChatGPT. 2023

对于生信初学者而言,最大的困难是身边没有经验丰富的人给予指导。而ChatGTP的出现可能改变这一现状,学生可以自己作为导师,指导ChatGPT完成数据分析工作。

众所周知,与ChatGPT互动,给予的指令越精确,那么它给出的答案越精准。这篇论文提出一个与ChatGPT互动的模型:OPTICAL。其基本思想是通过迭代不断优化给予ChatGPT的指令。

该模型的流程图如下:

  1. 给予初始提示。

  2. 机器人产生分析代码。

  3. 运行代码。

    如果出现错误,转向优化提示词。

    如果代码正确,继续下一步。

  4. 评估结果。

    如果结果不符合预期,转向优化提示词。

    如果结果符合预期,继续下一步。

  5. 审查代码,得到最终提示词并归档方法。

这个模型本身平平无奇,符合平常人们使用ChatGPT的习惯:即不断优化提示词,直至得到正确答案。下面两个案例很好地体现了这一过程。

案例一:下一代测序的短读段比对和视觉检查

定义聊天机器人的行为:

Act as an experienced bioinformatician proficient in ChIP-Seq data analysis, you will assist me by writing code with number of lines as minimal as possible. Rest the thread if asked to. Reply "YES" if understand.

迭代0

I have two fastq files in current folder from single-end sequencing of a ChIP-Seq library: ENCFF000AVS_1m.fastq.gz, and ENCFF000AVS_10m.fastq.gz. For each fastq file, align reads to the human reference genome, save to bam file, and then covert it to bigwig file. Tools to use: bowtie2, samtools, and deepTools. The index for bowtie2 is in the folder "../data/indx/bowtie2_whole_genome/" with "hg38" as the prefix. Use 24 CPU for the alignment. Please draft the code in bash.

迭代1

E::idx_find_and_load Could not retrieve index file for 'ENCFF000AVS_1m.bam'

迭代2

Wait, I saw that you have "samtools index" before "bamcoverage". Does bamcoverage as bam to be sorted before using as input?

审查代码

I need to insert line-by-line comments to the below code which works well to address the needs for the data analysis task. Wait for my code.

最终提示词(粗体字是经过迭代加入的提示细节):

Act as an experienced bioinformatician proficient in ChIP-Seq data analysis, you will assist me by writing code with number of lines as minimal as possible. Rest the thread if asked to. Reply "YES" if understand.

I have two fastq files in current folder from single-end sequencing of a ChIP-Seq library: ENCFF000AVS_1m.fastq.gz, and ENCFF000AVS_10m.fastq.gz. For each fastq file, align reads to the human reference genome, save to bam file, index it , and then covert it to bigwig file with CPM normalization. Tools to use: bowtie2, samtools, and deepTools. The index for bowtie2 is in the folder "../data/indx/bowtie2_whole_genome/" with "hg38" as the prefix. Use 24 CPU for the alignment. Please draft the code in bash.

安全二:推断DNA序列的分子进化系统发育树

定义聊天机器人的行为:

Act as an experienced bioinformatician proficient in R, you will write code with number of lines as minimal as possible. Rest the thread if asked to. Reply "YES" if understand.

迭代0

You have a multiple alignment file named as tp53.clustal in ClustalW format. Please write R code that can load the file, calculate evolutionary distance, build a NJ tree, and visualize the phylogeny.

迭代1

I got an error message complaining "could not find function "read.alignment". Please fix it.

迭代2

I got a warning message " In dist.dna(aln) : NAs introduced by coercion". Please fix it.

迭代3

I wrote an R program to read a multiple alignment file named as tp53.clustal in ClustalW format, calculate evolutionary distance, build a NJ tree, and visualize the phylogeny. But I want to root the tree with the Zebrafish sequence as the outgroup. Can you help me revise the R code? Below is my R code.

php 复制代码
# Load the required packages
library(seqinr)
library(ape)
# Read in the alignment file
aln <- read.alignment("tp53.clustal", format="clustal")
# Calculate the evolutionary distance
dist <- dist.dna(as.DNAbin(aln))
# Build the NJ tree
tree <- nj(dist)
# Plot the phylogeny
plot(tree)

迭代4

I got an error message complaining " Error in nj(dist, outgroup = zebrafish_idx) unused argument (outgroup = zebrafish_idx)". Please fix it.

迭代5

I got an error message complaining "Error in if (newroot == ROOT) { : argument is of length zero". Please fix it.

审查代码

I created the following R code. Please add inline comments.

最终提示词

无。

关于简说基因

生信平台

Galaxy中国(UseGalaxy.cn)致力于打造中国人的云上生物信息基础设施。大量在线工具免费使用。无需安装,用完即走。活跃的用户社区,随时交流使用心得。
*

生信培训

简说基因的生信培训班,荣获学员的一致好评。如果你也对生物信息学感兴趣,欢迎来跟简说基因,学真生信

生信分析

我们能够承接所有 NGS 组学数据分析业务,包括但不限于 WGS / WES / RNA-seq 等。基因组组装、注释,以及各种重测序业务都可以与简说基因合作。

相关推荐
zhiSiBuYu0517几秒前
建立 AI 辅助开发的 Code Review 流程实战指南
人工智能·代码复审
装不满的克莱因瓶2 分钟前
自然语言处理中的分词——从语言切分到模型输入的第一步
人工智能·pytorch·python·深度学习·ai·自然语言处理
这个DBA有点耶4 分钟前
Vibe Coding 是什么?当“感觉编程”遇上数据库
数据库·人工智能·架构·学习方法·ai编程·程序员创富·改行学it
QiLinkOS4 分钟前
QiLink开源生态的三维重构:基于时间、空间与社会价值的底层规则创新白皮书
大数据·c++·人工智能·科技·算法·gitee·开源
测试开发技术5 分钟前
AI 测试赋能全流程实战 | Agent Skill + AI 赋能「需求分析」
自动化测试·人工智能·自动化·需求分析·ai编程·ai测试
MartinYeung56 分钟前
[论文学习]CAMIA:基于上下文感知的成员资格推断攻击:针对预训练大型语言模型的深度分析
人工智能·学习·语言模型
qq_436962187 分钟前
从“技术稀缺”到“人人可用”:奥威BI+AI如何复刻工业革命级变革
大数据·人工智能
运维小欣9 分钟前
2026年AI 可观测平台选型指南
大数据·人工智能
Ztopcloud极拓云视角13 分钟前
我用AI辅助做了一个多端工具:解决2026世界杯回放被剧透的问题
人工智能·windows·个人开发
数智化精益手记局14 分钟前
拆解项目管理平台核心功能:看项目管理平台如何解决跨部门协作难题与多项目并行场景
大数据·运维·数据库·人工智能·产品运营