欢迎关注我的CSDN:https://spike.blog.csdn.net/
- OpenFold: 重新训练 AlphaFold2 揭示对于学习机制和泛化能力的新见解
OpenFold 是可训练的开源实现用于模拟 AlphaFold2 的结构预测能力,主要特点如下:
- 训练和性能:从头开始训练 OpenFold,并且达到与 AlphaFold2 相当的预测精度。同时 OpenFold 比 AlphaFold2 更快、更节省内存,支持在 PyTorch 框架下运行。
- 学习机制:通过分析 OpenFold 在训练过程中预测的结构,发现一些有趣的现象,例如空间维度、二级结构元素和三级尺度的分阶段学习,以及低维 PCA 投影的近似性。
- 泛化能力:通过使用不同大小和多样性的训练集,以及在结构分类上剔除部分训练数据,来评估 OpenFold 对于未见蛋白质折叠空间的泛化能力。发现 OpenFold 即使在极端缩减的训练集上,也能表现出惊人的鲁棒性和准确性。
GitHub: aqlaboratory/openfold
1. 结构推理
准备模型文件 finetuning_ptm_2.pt
,参考 Huggingface - OpenFold:
bash
pip install bypy
bypy info
bypy downfile /huggingface/openfold/finetuning_ptm_2.pt finetuning_ptm_2.pt
测试的推理命令,如下:
bash
python3 run_pretrained_openfold.py \
mydata/test \
af2-data-v230/pdb_mmcif/mmcif_files \
--uniref90_database_path af2-data-v230/uniref90/uniref90.fasta \
--mgnify_database_path af2-data-v230/mgnify/mgy_clusters_2022_05.fa \
--pdb70_database_path af2-data-v230/pdb70/pdb70 \
--uniclust30_database_path msa_databases/deepmsa2/uniclust30/uniclust30_2018_08 \
--output_dir mydata/output \
--bfd_database_path af2-data-v230/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--model_device "cuda:0" \
--jackhmmer_binary_path /opt/openfold/hhsuite-speed/jackhmmer \
--hhblits_binary_path /opt/conda/envs/openfold/bin/hhblits \
--hhsearch_binary_path /opt/conda/envs/openfold/bin/hhsearch \
--kalign_binary_path /opt/conda/envs/openfold/bin/kalign \
--config_preset "model_1_ptm" \
--openfold_checkpoint_path openfold/resources/openfold_params/finetuning_ptm_2.pt
运行日志,如下:
bash
INFO:openfold/openfold/utils/script_utils.py:Loaded OpenFold parameters at openfold/resources/openfold_params/finetuning_ptm_2.pt...
INFO:openfold/run_pretrained_openfold.py:Generating alignments for A...
INFO:openfold/openfold/utils/script_utils.py:Running inference for A...
INFO:openfold/openfold/utils/script_utils.py:Inference time: 10.128928968682885
INFO:openfold/run_pretrained_openfold.py:Output written to mydata/output/predictions/A_model_1_ptm_unrelaxed.pdb...
INFO:openfold/run_pretrained_openfold.py:Running relaxation on mydata/output/predictions/A_model_1_ptm_unrelaxed.pdb...
INFO:openfold/openfold/utils/script_utils.py:Relaxation time: 11.812019010074437
INFO:openfold/openfold/utils/script_utils.py:Relaxed output written to mydata/output/predictions/A_model_1_ptm_relaxed.pdb...
替换高性能的 Jackhmmer,位置如下:
bash
cp backup/hhsuite-speed-3.3.2/jackhmmer /opt/openfold/hhsuite-speed/jackhmmer
模型推理的输出,如下:
bash
alignments/ # MSA文件,与 AF2 相同
predictions/ # 预测结果
timings.json # 时间
tmp_2711.fasta # 缓存fasta
其中,在 timings.json
中,缓存推理耗时,即:
bash
{"inference": 12.08716268837452}
其中,在 alignments/A
文件夹中,包括 MSA 文件,序列数量如下:
bash
mgnify_hits.a3m # 56 行
pdb70_hits.hhr # 159 行
uniref90_hits.a3m # 58 行
bfd_uniref_hits.a3m
注意:与 AF2 不同的是,OpenFold 是 a3m 格式,而 AF2 是 sto 格式。
其中,在 predictions
文件夹中,默认只包括 1 个预测的结构,以及 Relax 的结构,如下:
bash
A_model_1_ptm_relaxed.pdb
A_model_1_ptm_unrelaxed.pdb
timings.json
预测结果如下,其中黄色是 Reference 结构,深蓝色是 AF2 的单模型预测结果,浅蓝色是 OpenFold 的 finetuning_ptm_2.pt
模型预测结果
- AF2:
{'TMScore': 0.9036, 'RMSD(local)': 1.66, 'Align.Len.': 117, 'DockQ': 0.0}
- OpenFold:
{'TMScore': 0.8601, 'RMSD(local)': 1.7, 'Align.Len.': 115, 'DockQ': 0.0}
即:
2. 环境配置
构建 base docker 环境,基于 AF2 的 docker,即:
bash
nvidia-docker run -it --name openfold-[your name] -v [nfs path]:[nfs path] af2:v1.02
2.1 配置 conda 与 pip 高速环境
在安装环境时,建议使用国内的 conda 与 pip 源,可以加速下载。
进入 docker 之后,首先修改 conda 与 pip 的环境配置。创建或修改 ~/.condarc
,即:
bash
vim ~/.condarc
# 添加如下信息
channels:
- defaults
show_channel_urls: true
default_channels:
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
channel_priority: disabled
allow_conda_downgrades: true
在 docker 中,存在默认的 pip 环境,而且优先级较高,即删除 pip 配置,再修改 pip 配置,避免失效或冲突,即:
bash
rm /opt/conda/pip.conf
rm /root/.config/pip/pip.conf
再修改配置 ~/.pip/pip.conf
,建议使用 阿里云 的 pip 源,清华源缺少部分安装包,即:
bash
vim ~/.pip/pip.conf
# 添加如下信息
# This file has been autogenerated or modified by NVIDIA PyIndex.
# In case you need to modify your PIP configuration, please be aware that
# some configuration files may have a priority order. Here are the following
# files that may exists in your machine by order of priority:
#
# [Priority 1] Site level configuration files
# 1. `/opt/conda/pip.conf`
#
# [Priority 2] User level configuration files
# 1. `/root/.config/pip/pip.conf`
# 2. `/root/.pip/pip.conf`
#
# [Priority 3] Global level configuration files
# 1. `/etc/pip.conf`
# 2. `/etc/xdg/pip/pip.conf`
[global]
no-cache-dir = true
index-url = http://mirrors.aliyun.com/pypi/simple/
extra-index-url = https://pypi.ngc.nvidia.com
trusted-host = mirrors.aliyun.com pypi.ngc.nvidia.com
2.2 配置 Docker 环境
建议 不要 使用默认命令配置 docker 镜像,即 docker build -t openfold .
,原因是下载速度较慢,而且有部分冲突,可以参考 Dockerfile 。
手动配置如下,配置 OpenFold 系统环境,即:
bash
# 添加 apt 源
apt-key del 7fa2af80
apt-key del 3bf863cc
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
# 安装源
apt-get update && apt-get install -y wget libxml2 cuda-minimal-build-11-3 libcusparse-dev-11-3 libcublas-dev-11-3 libcusolver-dev-11-3 git
注意:如果网速很慢,wget 需要耐心等待,建议重试几次。
配置 OpenFold 的 conda 环境 openfold
,即:
bash
# 复制环境文件
cd openfold
# 安装环境文件
conda env update -n openfold --file environment.yml && conda clean --all
如果中断,也可以重新更新,即:
bash
# 更新安装环境文件
conda activate openfold
conda env update --file /opt/openfold/environment.yml --prune
注意:需要时间较长,请耐心等待,当安装 pip 包出现异常时,建议手动安装。
遇到安装失败,建议手动安装,日志清晰,推荐 安装方式,即:
bash
# 创建环境
conda create -n openfold python=3.9
# 安装 conda 包
conda install -y -c conda-forge python=3.9 setuptools=59.5.0 pip openmm=7.5.1 pdbfixer cudatoolkit==11.3.*
conda install -y -c bioconda hmmer==3.3.2 hhsuite==3.3.0 kalign2==2.04
conda install -y -c pytorch pytorch=1.12.*
# 安装 pip 包
pip install 'dllogger @ git+https://github.com/NVIDIA/dllogger.git'
pip install biopython==1.79 deepspeed==0.5.10 dm-tree==0.1.6 ml-collections==0.1.0 numpy==1.21.2 PyYAML==5.4.1 requests==2.26.0 scipy==1.7.1 tqdm==4.62.2 typing-extensions==3.10.0.2 pytorch_lightning==1.5.10 wandb==0.12.21 modelcif==0.7
# 解决 bug
conda install -c anaconda numpy-base==1.22.3 # 解决 np.object bug,同时避免与 scipy 冲突。
注意:
openmm
的 7.5.1 版本,位于simtk
中,即from simtk.openmm import app
,在sites-package
中,没有独立的文件夹。
2.3 修复文件与编译工程
下载资源 stereo_chemical_props.txt
与修复文件 simtk.openmm
,即:
bash
cd openfold
# 注意位于 openfold/openfold/resources 中
wget -q -P openfold/resources https://git.scicore.unibas.ch/schwede/openstructure/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
# 注意 simtk.openmm 的安装位置需要选择
# conda show openmm
# import simtk
# print(simtk.__file__)
# /opt/conda/envs/openfold/lib/python3.9/site-packages/
patch -p0 -d /opt/conda/envs/openfold/lib/python3.9/site-packages/ < lib/openmm.patch
# 输出日志
patching file simtk/openmm/app/topology.py
Hunk #1 succeeded at 353 (offset -3 lines).
注意:openmm 的 7.5.1 版本需要修复一些 bug,高版本不需要,参考 关于 AlphaFold2 的 openmm.patch 补丁
编译工程,即 conda 环境中包括 openfold 的包,即
bash
cd openfold
python3 setup.py install
2.4 相关文件
配置 conda 环境需要参考 environment.yml
文件,即:
bash
name: openfold_venv
channels:
- conda-forge
- bioconda
- pytorch
dependencies:
- conda-forge::python=3.9
- conda-forge::setuptools=59.5.0
- conda-forge::pip
- conda-forge::openmm=7.5.1
- conda-forge::pdbfixer
- conda-forge::cudatoolkit==11.3.*
- bioconda::hmmer==3.3.2
- bioconda::hhsuite==3.3.0
- bioconda::kalign2==2.04
- pytorch::pytorch=1.12.*
- pip:
- biopython==1.79
- deepspeed==0.5.10
- dm-tree==0.1.6
- ml-collections==0.1.0
- numpy==1.21.2
- PyYAML==5.4.1
- requests==2.26.0
- scipy==1.7.1
- tqdm==4.62.2
- typing-extensions==3.10.0.2
- pytorch_lightning==1.5.10
- wandb==0.12.21
- modelcif==0.7
- git+https://github.com/NVIDIA/dllogger.git
配置环境需要参考 Dockerfile
文件,即:
bash
FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu18.04
# metainformation
LABEL org.opencontainers.image.version = "1.0.0"
LABEL org.opencontainers.image.authors = "Gustaf Ahdritz"
LABEL org.opencontainers.image.source = "https://github.com/aqlaboratory/openfold"
LABEL org.opencontainers.image.licenses = "Apache License 2.0"
LABEL org.opencontainers.image.base.name="docker.io/nvidia/cuda:10.2-cudnn8-runtime-ubuntu18.04"
RUN apt-key del 7fa2af80
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
RUN apt-get update && apt-get install -y wget libxml2 cuda-minimal-build-11-3 libcusparse-dev-11-3 libcublas-dev-11-3 libcusolver-dev-11-3 git
RUN wget -P /tmp \
"https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" \
&& bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda \
&& rm /tmp/Miniconda3-latest-Linux-x86_64.sh
ENV PATH /opt/conda/bin:$PATH
COPY environment.yml /opt/openfold/environment.yml
# installing into the base environment since the docker container wont do anything other than run openfold
RUN conda env update -n base --file /opt/openfold/environment.yml && conda clean --all
COPY openfold /opt/openfold/openfold
COPY scripts /opt/openfold/scripts
COPY run_pretrained_openfold.py /opt/openfold/run_pretrained_openfold.py
COPY train_openfold.py /opt/openfold/train_openfold.py
COPY setup.py /opt/openfold/setup.py
COPY lib/openmm.patch /opt/openfold/lib/openmm.patch
RUN wget -q -P /opt/openfold/openfold/resources \
https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
RUN patch -p0 -d /opt/conda/lib/python3.9/site-packages/ < /opt/openfold/lib/openmm.patch
WORKDIR /opt/openfold
RUN python3 setup.py install
2.5 提交 Docker Image
登录 docker 服务器,即:
docker login harbor.[ip address].com
注意:如果无法登录,则需要管理员配置,或切换可登录的服务器。
设置 BOS 命令:
bash
alias bos='bcecmd/bcecmd --conf-path bcecmd/bceconf/ bos'
提交 docker image,设置标签 (tag),以及上传 docker,即:
bash
# 提交 Tag
docker ps -l
docker commit [container id] openfold:v1.0
# 准备远程 Tag
docker tag openfold:v1.0 openfold:v1.0
docker images | grep "openfold"
# 推送至远程
docker push openfold:v1.0
# 从远程拉取
docker pull openfold:v1.0
# 或者保存至本地
docker save openfold:v1.0 | gzip > openfold_v1_0.tar.gz
# 加载已保存的 docker image
docker image load -i openfold_v1_01.tar.gz
docker images | grep "openfold"
进入 Harbor 页面查看,发现已上传的 docker image,以及不同版本,即:
3. Bugfix
3.1 Numpy 版本不兼容
Bug 日志:
bash
openfold/openfold/data/templates.py:88: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
"template_domain_names": np.object,
Traceback (most recent call last):
File "openfold/run_pretrained_openfold.py", line 47, in <module>
from openfold.data import templates, feature_pipeline, data_pipeline
File "openfold/openfold/data/templates.py", line 88, in <module>
"template_domain_names": np.object,
File "/opt/conda/envs/openfold/lib/python3.9/site-packages/numpy/__init__.py", line 319, in __getattr__
raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
即 Numpy 版本过高,没有 np.object
属性,建议降低至 1.23.4
版本,即:
bash
conda list numpy
# 当前 numpy-base 的版本是 1.25.2
# conda list numpy
# packages in environment at /opt/conda/envs/openfold:
#
# Name Version Build Channel
numpy 1.21.2 pypi_0 pypi
numpy-base 1.25.2 py39hb5e798b_0 defaults
# 降低版本至 1.23.4
conda install -c anaconda numpy-base==1.22.3 # 解决 np.object bug,同时避免与 scipy 冲突。
也可以,修改源码文件 openfold/data/templates.py
与 openfold/data/data_pipeline.py
,将 np.object 替换为 object,注意,全局搜索,需要修改 2 处,即:
python
TEMPLATE_FEATURES = {
"template_aatype": np.int64,
"template_all_atom_mask": np.float32,
"template_all_atom_positions": np.float32,
"template_domain_names": np.object, # 需要修改
"template_sequence": np.object, # 需要修改
"template_sum_probs": np.float32,
}
Bug 参考:
- StackOverflow - module 'numpy' has no attribute 'object' closed
- 关于 scipy 与 numpy 的兼容性,参考: Toolchain Roadmap
参考
参考: