提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
文章目录
前言
这是MolT5的安装和embedding获取(gpu版)。
零、安装
c
conda activate base
conda install -c conda-forge mamba -y
conda create -n molt5 python=3.9 -y
conda activate molt5
mamba install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip install sentencepiece accelerate
mamba install -c conda-forge rdkit -y
pip install "transformers==4.38.2"
# 需要离线的,去下载https://huggingface.co/laituan245/molt5-base,作为./molt5-base
一、使用步骤
1.引入库
c
import torch
from transformers import T5Tokenizer, T5EncoderModel
2.获取embedding
c
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = T5Tokenizer.from_pretrained("./molt5-base")
model = T5EncoderModel.from_pretrained("./molt5-base").to(device)
model.eval()
def get_molt5_embedding(
smiles: str,
pooling: str = "mean" # "mean" | "cls"
):
"""
Returns a 1D torch tensor embedding for a SMILES string.
"""
inputs = tokenizer(
smiles,
return_tensors="pt",
padding=False,
truncation=True,
max_length=512
).to(device)
with torch.no_grad():
outputs = model(**inputs) # last_hidden_state: [1, L, D]
hidden = outputs.last_hidden_state.squeeze(0) # [L, D]
if pooling == "mean":
emb = hidden.mean(dim=0) # [D]
elif pooling == "cls":
emb = hidden[0] # T5 没有真 CLS,只是第一个 token
else:
raise ValueError("pooling must be 'mean' or 'cls'")
return emb.cpu()
smiles = "CCOC(=O)C1=CC=CC=C1" # unmapped canonical SMILES
emb = get_molt5_embedding(smiles)
print(emb.shape)
输出:torch.Size([768])
总结
MolT5是小分子预训练好的语言模型,它能获得小分子768维的embedding,进行后续建模和操作。