AlphaFold 实验版 ipynb解析

用于解析AlphaFold进行蛋白质预测的代码流程,其中包含详细的代码注解。

名词解释

同源结构(homologous structures):在不同物种之间,由于他们具有共同的祖先,因而在形态、结构或功能上呈现出相似性,但可能因适应不同的环境而有所差异。

BFD数据库( Big Fantastic Database):大型蛋白质序列数据库,包含了MSAs和HMMs两个部分。MSAs包含了65983866个蛋白质家族;HMMs包含了2204359010个蛋白质序列。

PLDDT(Predicted Local Distance Difference Test):是用于评估蛋白质结构预测模型(如AlphaFold)预测质量的置信度分数。当PLDDT >=0.7时,通常认为预测的结构在该残基位置上是可能正确的,当PLDDT <0.7,预测的结构在该残基位置上可能存在问题或错误。

实验版本对比AlphaFold 2.3.2 的修改点

这篇实验室NoteBoost采用了无模版(同源结构)和BFD数据库的部分选取。我们已经在几千个最近的PDB结构上验证了这些更改。虽然在许多目标上的精度与完整的AplphaFold系统几乎相同,但由于较小的MSA和缺乏模版,一小部分的精度会有很大的下降。为了获得最佳的可靠性,我们建议使用完全开源的AlphaFold或者AlphaFold蛋白质结构数据库。

与本地AplhaFold安装相比,这个实验版本的多定时器平均精度略有下降,要获得完整的多定时器精度,强烈建议在本地运行AlphaFold。此外,AlphaFold-Multimer需要为复合物中的每一个序列搜索MSA,因此它实际上更慢,如果你的noteboot由于多定时器MSA搜索缓慢而超时,建议使用Colab Pro或在本地运行AlphaFold。

请注意,这个实验版笔记本质提供给理论建模,在使用时应该谨慎。

PAE文件格式已更新以匹配AFDB。有关新格式的描述,请参阅AFDB FAQ。

组织方式

首先运行下面的2个单元格来设置AlphaFold和所有需要的软件

bash 复制代码
import os
os.environ['TF_FORCE_UNIFIED_MEMORY'] = '1'
os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION'] = '4.0'

#@title 1. 安装第三方软件包

#@markdown Please execute this cell by pressing the _Play_ button
#@markdown on the left to download and import third-party software
#@markdown in this Colab notebook. (See the [acknowledgements](https://github.com/deepmind/alphafold/#acknowledgements) in our readme.)

#@markdown **Note**: This installs the software on the Colab
#@markdown notebook in the cloud and not on your computer.

from IPython.utils import io
import os
import subprocess
import tqdm.notebook

#进度条展示
TQDM_BAR_FORMAT = '{l_bar}{bar}| {n_fmt}/{total_fmt} [elapsed: {elapsed} remaining: {remaining}]'

try:
  #设置进度条信息
  with tqdm.notebook.tqdm(total=100, bar_format=TQDM_BAR_FORMAT) as pbar:
    with io.capture_output() as captured:
      # Uninstall default Colab version of TF.
      %shell pip uninstall -y tensorflow keras

      #安装hmmer,用于搜索序列数据库,查找序列同源物,进行序列比对
      %shell sudo apt install --quiet --yes hmmer
      pbar.update(6)

      #安装py3dmol,用于分子可视化,支持多种分子格式,包括mmtf(MacroMolecular Transmission Format)和Sdf(Structure Data File)等
      %shell pip install py3dmol
      pbar.update(2)

      # Install OpenMM and pdbfixer.
      %shell rm -rf /opt/conda
      %shell wget -q -P /tmp \
        https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
          && bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda \
          && rm /tmp/Miniconda3-latest-Linux-x86_64.sh
      pbar.update(9)

      PATH=%env PATH
      %env PATH=/opt/conda/bin:{PATH}
      #安装OpenMM,用于构建分子动力学模型,提供高性能支持
      #安装PDBFixer,用于解决模拟前常见问题:
      #X射线晶体学生成自动添加氢原子
      #柔性区域重原子缺失,如侧链末端原子
      #非标准残基:为晶体学目的而添加在模拟中不需要的非标准残基,PDBFixer能够识别和替换
      #多余分子:文件中可能包含为实验目的而添加的分子,如盐、配体等
      #多个链结构:晶体学单元格中可能包含多条链结构体。但研究者只想模拟一条链,PDBFixer能帮助选择并保留所需的链结构
      #多种离子:PDBFixer还能处理文件中可能存在的多种离子
      %shell conda install -qy conda==24.1.2 \
          && conda install -qy -c conda-forge \
            python=3.10 \
            openmm=8.0.0 \
            pdbfixer
      pbar.update(80)

      # Create a ramdisk to store a database chunk to make Jackhmmer run fast.
      %shell sudo mkdir -m 777 --parents /tmp/ramdisk
      %shell sudo mount -t tmpfs -o size=9G ramdisk /tmp/ramdisk
      pbar.update(2)

      %shell wget -q -P /content \
        https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
      pbar.update(1)
except subprocess.CalledProcessError:
  print(captured)
  raise

executed_cells = set([1])
bash 复制代码
#alphaFold源码的git地址
GIT_REPO = 'https://github.com/deepmind/alphafold'

#参数下载
SOURCE_URL = 'https://storage.googleapis.com/alphafold/alphafold_params_colab_2022-12-06.tar'
PARAMS_DIR = './alphafold/data/params'
PARAMS_PATH = os.path.join(PARAMS_DIR, os.path.basename(SOURCE_URL))

try:
  #设置进度条
  with tqdm.notebook.tqdm(total=100, bar_format=TQDM_BAR_FORMAT) as pbar:
    with io.capture_output() as captured:
      %shell rm -rf alphafold
      #克隆alphafold最新代码
      %shell git clone --branch main {GIT_REPO} alphafold
      pbar.update(8)
      # 下载源码的依赖库
      %shell pip3 install -r ./alphafold/requirements.txt
      # 重置setup.py,只下载AlphaFold
      %shell pip3 install --no-dependencies ./alphafold
      %shell pip3 install pyopenssl==22.0.0
      pbar.update(10)

      # Make sure stereo_chemical_props.txt is in all locations where it could be searched for.
      %shell mkdir -p /content/alphafold/alphafold/common
      %shell cp -f /content/stereo_chemical_props.txt /content/alphafold/alphafold/common
      %shell mkdir -p /opt/conda/lib/python3.10/site-packages/alphafold/common/
      %shell cp -f /content/stereo_chemical_props.txt /opt/conda/lib/python3.10/site-packages/alphafold/common/

      # 加载参数
      %shell mkdir --parents "{PARAMS_DIR}"
      %shell wget -O "{PARAMS_PATH}" "{SOURCE_URL}"
      pbar.update(27)

      %shell tar --extract --verbose --file="{PARAMS_PATH}" \
        --directory="{PARAMS_DIR}" --preserve-permissions
      %shell rm "{PARAMS_PATH}"
      pbar.update(55)
except subprocess.CalledProcessError:
  print(captured)
  raise

import jax
if jax.local_devices()[0].platform == 'tpu':
  raise RuntimeError('Colab TPU runtime not supported. Change it to GPU via Runtime -> Change Runtime Type -> Hardware accelerator -> GPU.')
elif jax.local_devices()[0].platform == 'cpu':
  raise RuntimeError('Colab CPU runtime not supported. Change it to GPU via Runtime -> Change Runtime Type -> Hardware accelerator -> GPU.')
else:
  print(f'Running with {jax.local_devices()[0].device_kind} GPU')

# Make sure everything we need is on the path.
import sys
sys.path.append('/opt/conda/lib/python3.10/site-packages')
sys.path.append('/content/alphafold')

executed_cells.add(2)

进行预测

请将您的蛋白质序列粘贴在下面的文本框中,然后运行cell

请注意,针对数据库的搜索和实际预测可能需要一些时间,几分钟到几小时,这取决于蛋白质的长度和Colab分配的GPU。

1)校验输入需要折叠的氨基酸序列

ini 复制代码
#如果你输入的是单序列,那么单体模式将会启用
#如果你输入的是多序列,那么多定时器模式将会启用

from alphafold.notebooks import notebook_utils
# Track cell execution to ensure correct order.
notebook_utils.check_cell_execution_order(executed_cells, 3)

import enum

@enum.unique
class ModelType(enum.Enum):
  MONOMER = 0
  MULTIMER = 1
#字符缩写说明
#A 丙氨酸
#E 谷氨酸
#H 组氨酸
#K 赖氨酸
#P 脯氨酸

sequence_1 = 'MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH'  #@param {type:"string"}
sequence_2 = ''  #@param {type:"string"}
sequence_3 = ''  #@param {type:"string"}
sequence_4 = ''  #@param {type:"string"}
sequence_5 = ''  #@param {type:"string"}
sequence_6 = ''  #@param {type:"string"}
sequence_7 = ''  #@param {type:"string"}
sequence_8 = ''  #@param {type:"string"}
sequence_9 = ''  #@param {type:"string"}
sequence_10 = ''  #@param {type:"string"}
sequence_11 = ''  #@param {type:"string"}
sequence_12 = ''  #@param {type:"string"}
sequence_13 = ''  #@param {type:"string"}
sequence_14 = ''  #@param {type:"string"}
sequence_15 = ''  #@param {type:"string"}
sequence_16 = ''  #@param {type:"string"}
sequence_17 = ''  #@param {type:"string"}
sequence_18 = ''  #@param {type:"string"}
sequence_19 = ''  #@param {type:"string"}
sequence_20 = ''  #@param {type:"string"}

input_sequences = (
    sequence_1, sequence_2, sequence_3, sequence_4, sequence_5, 
    sequence_6, sequence_7, sequence_8, sequence_9, sequence_10,
    sequence_11, sequence_12, sequence_13, sequence_14, sequence_15, 
    sequence_16, sequence_17, sequence_18, sequence_19, sequence_20)

MIN_PER_SEQUENCE_LENGTH = 16
MAX_PER_SEQUENCE_LENGTH = 4000
MAX_MONOMER_MODEL_LENGTH = 2500
MAX_LENGTH = 4000
MAX_VALIDATED_LENGTH = 3000

#@markdown Due to improved memory efficiency the multimer model has a maximum
#@markdown limit of 4000 residues, while the monomer model has a limit of 2500
#@markdown residues.

#该选项用于开启单序列的多聚体模型,对于原始形式为单体的蛋白质,或者对于非常大的单链,您可以通过多聚体模型,获得更好的准确性和记忆效率
#由于内存效率的改善,多聚体模式最大残留上限为4000而单体模式最大为2500
use_multimer_model_for_monomers = False #@param {type:"boolean"}

#验证输入序列
sequences = notebook_utils.clean_and_validate_input_sequences(
    input_sequences=input_sequences,
    min_sequence_length=MIN_PER_SEQUENCE_LENGTH,
    max_sequence_length=MAX_PER_SEQUENCE_LENGTH)

if len(sequences) == 1:
  if use_multimer_model_for_monomers:
    print('Using the multimer model for single-chain, as requested.')
    model_type_to_use = ModelType.MULTIMER
  else:
    print('Using the single-chain model.')
    model_type_to_use = ModelType.MONOMER
else:
  print(f'Using the multimer model with {len(sequences)} sequences.')
  model_type_to_use = ModelType.MULTIMER

#校验最大长度是否超过了限制
total_sequence_length = sum([len(seq) for seq in sequences])
if total_sequence_length > MAX_LENGTH:
  raise ValueError('The total sequence length is too long: '
                   f'{total_sequence_length}, while the maximum is '
                   f'{MAX_LENGTH}.')

#校验是否超出了单体限制
if model_type_to_use == ModelType.MONOMER:
  if len(sequences[0]) > MAX_MONOMER_MODEL_LENGTH:
    raise ValueError(
        f'Input sequence is too long: {len(sequences[0])} amino acids, while '
        f'the maximum for the monomer model is {MAX_MONOMER_MODEL_LENGTH}. You may '
        'be able to run this sequence with the multimer model by selecting the '
        'use_multimer_model_for_monomers checkbox above.')
    
if total_sequence_length > MAX_VALIDATED_LENGTH:
  print('WARNING: The accuracy of the system has not been fully validated '
        'above 3000 residues, and you may experience long running times or '
        f'run out of memory. Total sequence length is {total_sequence_length} '
        'residues.')

executed_cells.add(3)

2)基于基因数据库进行检索

ini 复制代码
#该单元格执行后,您将看到有关被AlphaFold使用的多序列对比(MSA)的统计信息
#特别的,您将看到每个残基在MSA中被相似序列覆盖的情况。

#跟踪单元格的执行来确保正确的顺序
notebook_utils.check_cell_execution_order(executed_cells, 4)

# --- Python imports ---
import collections
import copy
from concurrent import futures
import json
import random
import shutil

from urllib import request
from google.colab import files
from matplotlib import gridspec
import matplotlib.pyplot as plt
import numpy as np
import py3Dmol

from alphafold.model import model
from alphafold.model import config
from alphafold.model import data

from alphafold.data import feature_processing
from alphafold.data import msa_pairing
from alphafold.data import pipeline
from alphafold.data import pipeline_multimer
from alphafold.data.tools import jackhmmer

from alphafold.common import confidence
from alphafold.common import protein

from alphafold.relax import relax
from alphafold.relax import utils

from IPython import display
from ipywidgets import GridspecLayout
from ipywidgets import Output

#用于可视化的置信度色带
PLDDT_BANDS = [(0, 50, '#FF7D45'),
               (50, 70, '#FFDB13'),
               (70, 90, '#65CBF3'),
               (90, 100, '#0053D6')]

# --- 找到最近的源 ---
#UniRef90: UniRef是由Uniprot数据库维护的蛋白质序列数据库,它包含了非冗余的蛋白质序列集合。
#UniRef90是特定的一个UniRef子集,它包含了具有90%或更高序列相似性的蛋白质序列的聚类。
#fasta格式是生物信息学中常用的一种文件格式,用于存储多序列的文本表示
test_url_pattern = 'https://storage.googleapis.com/alphafold-colab{:s}/latest/uniref90_2022_01.fasta.1'
ex = futures.ThreadPoolExecutor(3)

#下载源
def fetch(source):
  request.urlretrieve(test_url_pattern.format(source))
  return source
  
fs = [ex.submit(fetch, source) for source in ['', '-europe', '-asia']]
source = None
for f in futures.as_completed(fs):
  source = f.result()
  ex.shutdown()
  break

JACKHMMER_BINARY_PATH = '/usr/bin/jackhmmer'
DB_ROOT_PATH = f'https://storage.googleapis.com/alphafold-colab{source}/latest/'
# z_value是数据库中序列的数量
MSA_DATABASES = [
    {'db_name': 'uniref90',
     'db_path': f'{DB_ROOT_PATH}uniref90_2022_01.fasta',
     'num_streamed_chunks': 62,
     'z_value': 144_113_457},
    {'db_name': 'smallbfd',
     'db_path': f'{DB_ROOT_PATH}bfd-first_non_consensus_sequences.fasta',
     'num_streamed_chunks': 17,
     'z_value': 65_984_053},
    {'db_name': 'mgnify',
     'db_path': f'{DB_ROOT_PATH}mgy_clusters_2022_05.fasta',
     'num_streamed_chunks': 120,
     'z_value': 623_796_864},
]

#搜索UniProt并仅为异构体构建all_req特征,同构体排除
if model_type_to_use == ModelType.MULTIMER and len(set(sequences)) > 1:
  MSA_DATABASES.extend([
      # Swiss-Prot and TrEMBL are concatenated together as UniProt.
      {'db_name': 'uniprot',
       'db_path': f'{DB_ROOT_PATH}uniprot_2021_04.fasta',
       'num_streamed_chunks': 101,
       'z_value': 225_013_025 + 565_928},
  ])

TOTAL_JACKHMMER_CHUNKS = sum([cfg['num_streamed_chunks'] for cfg in MSA_DATABASES])

MAX_HITS = {
    'uniref90': 10_000,
    'smallbfd': 5_000,
    'mgnify': 501,
    'uniprot': 50_000,
}


def get_msa(sequences):
  """使用分块Jackhmmer搜索搜索给定序列的MSA。
  
  Args:
    序列:要在所有数据库中搜索的序列列表

  Returns:
    一个将唯一序列映射到字典的字典,将每个数据库映射到结果列表,数据库的每个块对应一个字典。
  """
  sequence_to_fasta_path = {}
  # 去重是指在同构体中对同一链的多个副本不做冗余工作。
  for sequence_index, sequence in enumerate(sorted(set(sequences)), 1):
    fasta_path = f'target_{sequence_index:02d}.fasta'
    with open(fasta_path, 'wt') as f:
      f.write(f'>query\n{sequence}')
    sequence_to_fasta_path[sequence] = fasta_path

  # 对基因数据库的块进行搜索(因为基因数据库无法装入Colab磁盘)。
  # 这段代码的主要目的是使用jackhmmer对一组蛋白质序列在多个数据库进行搜索,并跟踪进度
  raw_msa_results = {sequence: {} for sequence in sequence_to_fasta_path.keys()}
  print('\nGetting MSA for all sequences')
  with tqdm.notebook.tqdm(total=TOTAL_JACKHMMER_CHUNKS, bar_format=TQDM_BAR_FORMAT) as pbar:
    #定义回调函数,更新进度条
    def jackhmmer_chunk_callback(i):
      pbar.update(n=1)

    for db_config in MSA_DATABASES:
      db_name = db_config['db_name']
      pbar.set_description(f'Searching {db_name}')
      #初始化jackhmmer运行器
      jackhmmer_runner = jackhmmer.Jackhmmer(
          binary_path=JACKHMMER_BINARY_PATH,
          database_path=db_config['db_path'],
          get_tblout=True,
          num_streamed_chunks=db_config['num_streamed_chunks'],
          streaming_callback=jackhmmer_chunk_callback,
          z_value=db_config['z_value'])
          
      # 查询数据库中每个块的所有唯一序列,以防止重复获取每个唯一序列的每个块。
      results = jackhmmer_runner.query_multiple(list(sequence_to_fasta_path.values()))
      
      #对于每个查询序列,将其结果存储在raw_msa_results自定中,外部键是序列标识符,内部是数据库名称,值是查询结果
      for sequence, result_for_sequence in zip(sequence_to_fasta_path.keys(), results):
        raw_msa_results[sequence][db_name] = result_for_sequence

  return raw_msa_results

#用于蛋白质序列集合(蛋白质链或结构域)生成用于AlphaFold模型的特征。
#初始化一个字典,用于存储每个蛋白质链的特征
features_for_chain = {}

#通过调用get_msa函数,获取原始多序列对比(MSA)结果
raw_msa_results_for_sequence = get_msa(sequences)

for sequence_index, sequence in enumerate(sequences, start=1):
  raw_msa_results = copy.deepcopy(raw_msa_results_for_sequence[sequence])

  #从Stockholm文件提取MSA
  # NB: 重复数据删除在pipeline.make_msa_features后面进行。
  single_chain_msas = []
  uniprot_msa = None
  #遍历每个数据库的MSA结果
  for db_name, db_results in raw_msa_results.items():
    #使用merge_chunked_msa函数,合并分块的MSA结果,并限制最大命中数(通过MAX_HITS字典指定)
    merged_msa = notebook_utils.merge_chunked_msa(
        results=db_results, max_hits=MAX_HITS.get(db_name))
    #如果合并后的MSA包含序列,并且数据库名称不是'uniprot',将其添加到single_chain_msas列表,并打印独特序号
    if merged_msa.sequences and db_name != 'uniprot':
      single_chain_msas.append(merged_msa)
      msa_size = len(set(merged_msa.sequences))
      print(f'{msa_size} unique sequences found in {db_name} for sequence {sequence_index}')
    elif merged_msa.sequences and db_name == 'uniprot':
      #找到了数据库名称是uniprot,防止到uniprot_msa变量
      uniprot_msa = merged_msa

  #显示MSA的详细信息
  notebook_utils.show_msa_info(single_chain_msas=single_chain_msas, sequence_index=sequence_index)

  #将原格式数据转换为模型特征
  #初始化一个空字典,用来存储序列特征
  feature_dict = {}
  
  #使用make_sequence_features函数为查询序列生成序列特征,并将其添加到feature_dict
  feature_dict.update(pipeline.make_sequence_features(
      sequence=sequence, description='query', num_res=len(sequence)))

  #使用make_msa_features函数,为single_chain_msas列表中的每个MSA生成MSA特征
  feature_dict.update(pipeline.make_msa_features(msas=single_chain_msas))
  
  #由于AlphaFold Colab笔记本不使用模板,因此使用notebook_utils.empty_placeholder_template_features函数添加空的占位符
  feature_dict.update(notebook_utils.empty_placeholder_template_features(
      num_templates=0, num_res=len(sequence)))

  # Construct the all_seq features only for heteromers, not homomers.
  #处理多聚体特征
  #如果模型类型是多聚体模型,并且序列集合包含多个独特的序列(即不是同聚体)
  if model_type_to_use == ModelType.MULTIMER and len(set(sequences)) > 1:
    #定义一个有效的特征列表valid_feats,这些特征将被用于生成"all_seq"特征
    valid_feats = msa_pairing.MSA_FEATURES + (
        'msa_species_identifiers',
    )
    #使用make_msa_features函数,为uniprot_msa生成特征,并仅选择valid_feats列表中的特征
    #将这些特征添加到feature_dict中,但键名前添加了_all_seq后缀。
    all_seq_features = {
        f'{k}_all_seq': v for k, v in pipeline.make_msa_features([uniprot_msa]).items()
        if k in valid_feats}
    feature_dict.update(all_seq_features)

  #使用蛋白链的PDB链ID(从protein.PDB_CHAIN_IDS列表中获取)作为键,将feature_dict存储在features_for_chain中
  features_for_chain[protein.PDB_CHAIN_IDS[sequence_index - 1]] = feature_dict


# 根据模型类型,进行进一步的特征后处理
if model_type_to_use == ModelType.MONOMER:
  # 单体模式,直接从特征链中取出赋值到np_example
  np_example = features_for_chain[protein.PDB_CHAIN_IDS[0]]

elif model_type_to_use == ModelType.MULTIMER:
  #定义一个空字典,用于存储所有链的特征
  all_chain_features = {}
  
  #遍历features_for_chain字典,将每个链的特征通过convert_monomer_features函数进行转换,并将结果存储在all_chain_features字典中
  #这个函数可能将单体特征转换为多聚体模型所需格式
  for chain_id, chain_features in features_for_chain.items():
    all_chain_features[chain_id] = pipeline_multimer.convert_monomer_features(
        chain_features, chain_id)

  #向all_chain_features添加与多聚体组装相关的特征
  all_chain_features = pipeline_multimer.add_assembly_features(all_chain_features)

  #使用pair_and_merge函数,将all_chain_features中的特征进行配对和合并,生成np_example
  #这个函数可能处理链间相互作用和其他与多聚体结构相关的特征
  np_example = feature_processing.pair_and_merge(
      all_chain_features=all_chain_features)

  # 调用pad_msa函数,确保MSA(多序列对比)的大小至少为min_num_seq(此处为512),以避免出现零大小的extra_msa
  # 这是为了确保所有的输入都具有相同的最小长度,以满足模型的要求
  np_example = pipeline_multimer.pad_msa(np_example, min_num_seq=512)

#记录第四单元格已经被执行
executed_cells.add(4)

3) 运行AlphaFold并下载预测结果

ini 复制代码
#执行该单元格后,将自动将获得的预测结果下载到计算机上。
#如果你在relax阶段遇到问题,你可以在下面禁用它。这意味着预测可能有分散注意力的小立体化学违例。

run_relax = True  #@param {type:"boolean"}

#使用GPU时relax速度更快,但我们发现它不太稳定。
#你可能希望启用GPU以获得更高的性能,但如果它不收敛,我们建议恢复使用没有GPU的情况。
relax_use_gpu = False  #@param {type:"boolean"}

#多定时器模型将继续循环,直到预测结果停止变化,直到这里设置的极限。
#为了获得更高的精度,但可能会花费更长的推理时间,将其设置为20。


multimer_model_max_num_recycles = 3  #@param {type:"integer"}

# 跟踪单元的执行以确保正确的顺序
notebook_utils.check_cell_execution_order(executed_cells, 5)

# --- 运行模型 ---
#根据模型类型,设置模型名称
if model_type_to_use == ModelType.MONOMER:
  #添加model_2_ptm后缀
  model_names = config.MODEL_PRESETS['monomer'] + ('model_2_ptm',)
elif model_type_to_use == ModelType.MULTIMER:
  model_names = config.MODEL_PRESETS['multimer']

output_dir = 'prediction'
os.makedirs(output_dir, exist_ok=True)

plddts = {}
ranking_confidences = {}
pae_outputs = {}
unrelaxed_proteins = {}

#添加进度条
with tqdm.notebook.tqdm(total=len(model_names) + 1, bar_format=TQDM_BAR_FORMAT) as pbar:
  for model_name in model_names:
    #设置进度条描述
    pbar.set_description(f'Running {model_name}')

    #获取特定模型的配置信息,并赋值给cfg
    cfg = config.model_config(model_name)

    if model_type_to_use == ModelType.MONOMER:
      #对于单体模型,设置评估时的集成数量为1
      cfg.data.eval.num_ensemble = 1
    elif model_type_to_use == ModelType.MULTIMER:
      #对于多聚体模型,设置评估时的集成数量为1, 并设置模型回收的最大次数(num_recycle)和早期停止的容忍度(recycle_early_stop_tolerance)
      cfg.model.num_ensemble_eval = 1

    if model_type_to_use == ModelType.MULTIMER:
      cfg.model.num_recycle = multimer_model_max_num_recycles
      cfg.model.recycle_early_stop_tolerance = 0.5

    #调用get_model_haiku_params加载模型参数
    params = data.get_model_haiku_params(model_name, './alphafold/data')
    
    #初始化模型对象model_runner,用于运行模型
    model_runner = model.RunModel(cfg, params)

    #处理特征
    #使用process_features处理输入的特征np_example,并返回处理后的特征字典processed_feature_dict
    processed_feature_dict = model_runner.process_features(np_example, random_seed=0)

    #调用model_runner方法进行预测,传入处理后的特征字典和随机种子
    prediction = model_runner.predict(processed_feature_dict, random_seed=random.randrange(sys.maxsize))

    #计算平均plddt
    mean_plddt = prediction['plddt'].mean()

    #处理预测结果
    #predicted_aligned_error (PAE):衡量预测结构与真实结构之间对齐误差的指标
    #plddt(Predicted Local Distance Differenct Test):衡量模型预测局部结构准确性的指标
    #ranking_confience :模型对预测结构排名的信心程度
    #multimer_model_max_num_recycles: 多聚体模型的最大回收次数
    if model_type_to_use == ModelType.MONOMER:
      if 'predicted_aligned_error' in prediction:
        #单体模型,预测结果包含predicted_aligned_error和max_predicted_aligned_error,存储到pae_outputs
        pae_outputs[model_name] = (prediction['predicted_aligned_error'],
                                   prediction['max_predicted_aligned_error'])
      else:
        # Monomer models are sorted by mean pLDDT. Do not put monomer pTM models here as they
        # should never get selected.
        #将ranking_confidence和plddt存储到字典中
        ranking_confidences[model_name] = prediction['ranking_confidence']
        plddts[model_name] = prediction['plddt']
    elif model_type_to_use == ModelType.MULTIMER:
      # Multimer models are sorted by pTM+ipTM.
      ranking_confidences[model_name] = prediction['ranking_confidence']
      plddts[model_name] = prediction['plddt']
      pae_outputs[model_name] = (prediction['predicted_aligned_error'],
                                 prediction['max_predicted_aligned_error'])

    # 设置b-factors为per-residue plddt.
    # B因子(b-factors):也称为温度因子或原子位移参数,在晶体学中用于描述原子在晶体结构中的振动或不确定性。
    # 在AlphFold中,它们被用来预测模型的可靠性,通过将plddt的值与final_atom_mask相乘,代码将每个残基的plddt值作为对应原子的B因子
    # final_atom_mask是一个布尔数组,用于指示哪些原子是模型预测的最终结构中的一部分
    final_atom_mask = prediction['structure_module']['final_atom_mask']
    b_factors = prediction['plddt'][:, None] * final_atom_mask

    #从预测结果创建蛋白质结构
    #使用from_prediction方法,从预测结果(prediction)和特征字典(processed_feature_dict)中创建一个蛋白质结构对象
    #b因子作为参数传递,以便在生成的蛋白质结构中包含这些B因子信息
    #如果是单体模型,设置remove_leading_feature_dimension为true,这通常是因为单体模型的某些特征维度与多聚体模型不同,需要再创建结构时进行特殊处理。
    unrelaxed_protein = protein.from_prediction(
        processed_feature_dict,
        prediction,
        b_factors=b_factors,
        remove_leading_feature_dimension=(
            model_type_to_use == ModelType.MONOMER))
            
    #将结果保存到unrelaxed_proteins字典,键名称为模型名称
    unrelaxed_proteins[model_name] = unrelaxed_protein

    # Delete unused outputs to save memory.
    del model_runner
    del params
    del prediction
    pbar.update(n=1)

  # --- AMBER relax the best model ---

  # 根据PLDDT的值找到最佳模型
  best_model_name = max(ranking_confidences.keys(), key=lambda x: ranking_confidences[x])

  # 如果运行relax为真则使用AmberRelaxation类,对最佳模型进行relax
  #relax是一个优化步骤,用于减少模型中的能量,并改进其几何形状,最终得到PDB文件
  if run_relax:
    pbar.set_description(f'AMBER relaxation')
    amber_relaxer = relax.AmberRelaxation(
        max_iterations=0,
        tolerance=2.39,
        stiffness=10.0,
        exclude_residues=[],
        max_outer_iterations=3,
        use_gpu=relax_use_gpu)
    relaxed_pdb, _, _ = amber_relaxer.process(prot=unrelaxed_proteins[best_model_name])
  else:
    print('Warning: Running without the relaxation stage.')
    relaxed_pdb = protein.to_pdb(unrelaxed_proteins[best_model_name])
  pbar.update(n=1)  # Finished AMBER relax.

#构建多类B因子以指示置信区间,它划分了四个不同区间
# 0=very low, 1=low, 2=confident, 3=very high
banded_b_factors = []
for plddt in plddts[best_model_name]:
  #通过使用PLDDT_BANDS(一个预定义的置信度范围列表)和遍历最佳模型的PLDDT值来实现
  for idx, (min_val, max_val, _) in enumerate(PLDDT_BANDS):
    if plddt >= min_val and plddt <= max_val:
      banded_b_factors.append(idx)
      break
#对于每个PLDDT值,代码检查它属于哪个置信度范围,并将相应的索引(0到3)添加到banded_b_factors
banded_b_factors = np.array(banded_b_factors)[:, None] * final_atom_mask

#结果保存到to_visualize_pdb,用于可视化或分析使用
to_visualize_pdb = utils.overwrite_b_factors(relaxed_pdb, banded_b_factors)

#PDB写入指定文件,用于保存预测输出结果
pred_output_path = os.path.join(output_dir, 'selected_prediction.pdb')
with open(pred_output_path, 'w') as f:
  f.write(relaxed_pdb)


# 蛋白质可视化和置信度展示
show_sidechains = True

#定义plot_plddt_legend函数,用于绘制PLDDT置信度级别的图例
def plot_plddt_legend():
  """Plots the legend for pLDDT."""
  #thresh列表定义了4个置信度级别和它们对应的pLDDT返回
  thresh = ['Very low (pLDDT < 50)',
            'Low (70 > pLDDT > 50)',
            'Confident (90 > pLDDT > 70)',
            'Very high (pLDDT > 90)']

  #从PLDDT_BANDS(一个包含颜色信息的列表)中提取颜色
  colors = [x[2] for x in PLDDT_BANDS]

  # 生成条形图,并位每一个置信度级别分配一个颜色,隐藏坐标轴和边框,只显示图例
  plt.figure(figsize=(2, 2))
  for c in colors:
    plt.bar(0, 0, color=c)
  plt.legend(thresh, frameon=False, loc='center', fontsize=20)
  plt.xticks([])
  plt.yticks([])
  ax = plt.gca()
  ax.spines['right'].set_visible(False)
  ax.spines['top'].set_visible(False)
  ax.spines['left'].set_visible(False)
  ax.spines['bottom'].set_visible(False)
  plt.title('Model Confidence', fontsize=20, pad=20)
  return plt

#多聚体模型可视化
if model_type_to_use == ModelType.MULTIMER:
  #创建一个py3Dmol示图,并添加PDB模型作为帧
  multichain_view = py3Dmol.view(width=800, height=600)
  multichain_view.addModelsAsFrames(to_visualize_pdb)
  multichain_style = {'cartoon': {'colorscheme': 'chain'}}
  multichain_view.setStyle({'model': -1}, multichain_style)
  multichain_view.zoomTo()
  multichain_view.show()

# Color the structure by per-residue pLDDT
# 定义color_map字典,并将每个pLDDT范围映射为一个颜色

color_map = {i: bands[2] for i, bands in enumerate(PLDDT_BANDS)}
view = py3Dmol.view(width=800, height=600)
view.addModelsAsFrames(to_visualize_pdb)
#设置样式根据B因子颜色话模型。这里假设to_visualize_pdb中的B因子已经被替换为表示PlDDT的带标签值。
style = {'cartoon': {'colorscheme': {'prop': 'b', 'map': color_map}}}
#如果show_sidechains为TRUE,则同时显示侧链
if show_sidechains:
  style['stick'] = {}
view.setStyle({'model': -1}, style)
view.zoomTo()

# 创建一个GridspecLayout对象,用于1X2的网格布局
grid = GridspecLayout(1, 2)
out = Output()
with out:
  view.show()
grid[0, 0] = out

out = Output()
with out:
  plot_plddt_legend().show()
grid[0, 1] = out

display.display(grid)

# Display pLDDT and predicted aligned error (if output by the model).
if pae_outputs:
  num_plots = 2
else:
  num_plots = 1

plt.figure(figsize=[8 * num_plots, 6])
plt.subplot(1, num_plots, 1)
plt.plot(plddts[best_model_name])
plt.title('Predicted LDDT')
plt.xlabel('Residue')
plt.ylabel('pLDDT')

if num_plots == 2:
  plt.subplot(1, 2, 2)
  pae, max_pae = list(pae_outputs.values())[0]
  plt.imshow(pae, vmin=0., vmax=max_pae, cmap='Greens_r')
  plt.colorbar(fraction=0.046, pad=0.04)

  # Display lines at chain boundaries.
  best_unrelaxed_prot = unrelaxed_proteins[best_model_name]
  total_num_res = best_unrelaxed_prot.residue_index.shape[-1]
  chain_ids = best_unrelaxed_prot.chain_index
  for chain_boundary in np.nonzero(chain_ids[:-1] - chain_ids[1:]):
    if chain_boundary.size:
      plt.plot([0, total_num_res], [chain_boundary, chain_boundary], color='red')
      plt.plot([chain_boundary, chain_boundary], [0, total_num_res], color='red')

  plt.title('Predicted Aligned Error')
  plt.xlabel('Scored residue')
  plt.ylabel('Aligned residue')

# Save the predicted aligned error (if it exists).
pae_output_path = os.path.join(output_dir, 'predicted_aligned_error.json')
if pae_outputs:
  # Save predicted aligned error in the same format as the AF EMBL DB.
  pae_data = confidence.pae_json(pae=pae, max_pae=max_pae.item())
  with open(pae_output_path, 'w') as f:
    f.write(pae_data)

# --- Download the predictions ---
shutil.make_archive(base_name='prediction', format='zip', root_dir=output_dir)
files.download(f'{output_dir}.zip')

executed_cells.add(5)
相关推荐
Java Fans4 小时前
深入了解逻辑回归:机器学习中的经典算法
机器学习
慕卿扬5 小时前
基于python的机器学习(二)—— 使用Scikit-learn库
笔记·python·学习·机器学习·scikit-learn
夏天里的肥宅水5 小时前
机器学习3_支持向量机_线性不可分——MOOC
人工智能·机器学习·支持向量机
Troc_wangpeng7 小时前
机器学习的转型
人工智能·机器学习
小言从不摸鱼7 小时前
【NLP自然语言处理】深入解析Encoder与Decoder模块:结构、作用与深度学习应用
人工智能·深度学习·神经网络·机器学习·自然语言处理·transformer·1024程序员节
小码贾8 小时前
评估 机器学习 回归模型 的性能和准确度
人工智能·机器学习·回归·scikit-learn·性能评估
HyperAI超神经10 小时前
突破1200°C高温性能极限!北京科技大学用机器学习合成24种耐火高熵合金,室温延展性极佳
人工智能·深度学习·机器学习·数据集·ai4s·材料学·合金
阿里-于怀11 小时前
5分钟科普:AI网关是什么?应用场景是什么?有没有开源的选择?
机器学习
11 小时前
开源竞争-大数据项目期末考核
大数据·人工智能·算法·机器学习
喵~来学编程啦12 小时前
【数据处理】数据预处理·数据变换(熵与决策树)
人工智能·机器学习