macOS配置Apocrita及ssh访问及获取GPU权限

记录Queen Mary University of London的macOS配置Apocrita及ssh访问及获取GPU权限

Your ITS Research account has been created, your credentials are printed below:

Username: qp1111

Password: ***********

✅ Step 1：在 Mac 上生成 SSH key（如果你之前没生成过），打开 Terminal（应用 → 实用工具 → Terminal），输入：

powershell 复制代码

ssh-keygen -t ed25519 -C "your_email@example.com"

看到提示：

powershell 复制代码

Enter file in which to save the key (/Users/你的用户名/.ssh/id_ed25519):

直接按回车（使用默认路径即可）。

再提示设置 passphrase（密钥密码）

👉 可以设置，也可以直接回车跳过。

如果设置了，每次登录需要再输入一次。

完成后你会得到两个文件：

powershell 复制代码

~/.ssh/id_ed25519	私钥（不能给任何人）
~/.ssh/id_ed25519.pub	公钥（要上传给 Apocrita）

查看公钥：

powershell 复制代码

cat ~/.ssh/id_ed25519.pub

✅ Step 2：把公钥上传到 Apocrita 网站

找到类似：

"SSH Keys" → Add SSH Public Key

✅ Step 3：首次登录（使用密码 + 私钥）

在 Mac 终端里：

直接 SSH 登录（USERNAME 替换成你的 Apocrita 用户名，注意不是学校邮箱名）

powershell 复制代码

ssh -i ~/.ssh/id_ed25519 USERNAME@login.hpc.universityname.ac.uk

这一步会要求：

密钥的 passphrase（如果你设置过）& Apocrita 的账户密码

这是 Apocrita 的安全策略：必须私钥 + 密码双认证

✅ Step 4：修改密码指令

powershell 复制代码

passwd

输入旧的和新的密码

✅ Step 5（可选）：让每次登录更方便（推荐）

使用 ssh-agent（不必每次输入密钥密码）

powershell 复制代码

ssh-add ~/.ssh/id_ed25519

然后直接，后输入修改的密码

powershell 复制代码

ssh USERNAME@login.hpc.universityname.ac.uk

✅ 然后是提交一个作业获取GPU权限：这里提示大家要交一个真实的code不能是空跑一下哈哈

You account has now been created.

Access to GPUs is subject to the vetting process below.

Please can you run jobs on the short queue (set job runtime to 1 hour) and provide the job number(s) so we can double check they are optimised to run on the GPU cards? To speed up the validation process, please add echo $SGE_HGR_gpu into your job script to print out the GPU devices allocated to your job.

This is a one-off process which we ask all GPU users to complete before we grant access because we have limited GPU nodes and we want to ensure all jobs submitted to those nodes run correctly.

Once we have inspected your job and are satisfied it will run on the GPU nodes correctly, we'll grant access to run longer jobs.

✅ 创建conda环境

powershell 复制代码

module load miniforge/25.3.0
conda --version
conda create -n llm_gpu python=3.10 -y
conda activate llm_gpu

装环境

powershell 复制代码

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers

跑推理还要下载权重，跑个train吧

✅ Step1: mini_gpt_train.py

powershell 复制代码

vi mini_gpt_train.py

python 复制代码

import math
import time
import torch
import torch.nn as nn
import torch.nn.functional as F


# --------------------
# Configuration
# --------------------
VOCAB_SIZE = 4096
D_MODEL = 512
N_HEAD = 8
NUM_LAYERS = 8
DIM_FF = 2048
SEQ_LEN = 256
BATCH_SIZE = 32
NUM_STEPS = 20000
# NUM_STEPS = 3500 七分钟就跑完了
LOG_INTERVAL = 50


# --------------------
# Mini GPT-like Transformer Decoder
# --------------------
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.qkv_proj = nn.Linear(d_model, 3 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)

    def forward(self, x, attn_mask=None):
        B, T, D = x.size()
        qkv = self.qkv_proj(x)
        qkv = qkv.view(B, T, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        if attn_mask is not None:
            attn_scores = attn_scores.masked_fill(attn_mask == 0, float("-inf"))

        attn_weights = torch.softmax(attn_scores, dim=-1)
        attn_output = torch.matmul(attn_weights, v)

        attn_output = (
            attn_output.transpose(1, 2).contiguous().view(B, T, D)
        )
        out = self.out_proj(attn_output)
        return out


class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, dim_ff):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = MultiHeadSelfAttention(d_model, num_heads)
        self.ln2 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, dim_ff),
            nn.GELU(),
            nn.Linear(dim_ff, d_model),
        )

    def forward(self, x, attn_mask=None):
        x = x + self.attn(self.ln1(x), attn_mask=attn_mask)
        x = x + self.ff(self.ln2(x))
        return x


class MiniGPT(nn.Module):
    def __init__(self, vocab_size, d_model, n_head, num_layers, dim_ff, max_seq_len):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_seq_len, d_model)
        self.layers = nn.ModuleList(
            [TransformerBlock(d_model, n_head, dim_ff) for _ in range(num_layers)]
        )
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        self.max_seq_len = max_seq_len

    def forward(self, idx):
        B, T = idx.size()
        pos = torch.arange(0, T, device=idx.device).unsqueeze(0)

        x = self.token_emb(idx) + self.pos_emb(pos)

        mask = (
            torch.tril(torch.ones(T, T, device=idx.device))
            .unsqueeze(0)
            .unsqueeze(0)
        )

        for layer in self.layers:
            x = layer(x, attn_mask=mask)

        x = self.ln_f(x)
        logits = self.head(x)
        return logits


def main():
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print("Using device:", device, flush=True)

    model = MiniGPT(
        vocab_size=VOCAB_SIZE,
        d_model=D_MODEL,
        n_head=N_HEAD,
        num_layers=NUM_LAYERS,
        dim_ff=DIM_FF,
        max_seq_len=SEQ_LEN,
    ).to(device)

    print(
        "Model parameters:",
        sum(p.numel() for p in model.parameters()) / 1e6,
        "M",
        flush=True,
    )

    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
    criterion = nn.CrossEntropyLoss()

    start_time = time.time()

    for step in range(1, NUM_STEPS + 1):
        x = torch.randint(
            0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LEN), device=device, dtype=torch.long
        )
        y = x[:, 1:].contiguous()
        x_input = x[:, :-1].contiguous()

        logits = model(x_input)
        logits = logits.view(-1, VOCAB_SIZE)
        y = y.view(-1)

        loss = criterion(logits, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if step % LOG_INTERVAL == 0:
            elapsed_min = (time.time() - start_time) / 60.0
            print(
                f"Step {step}/{NUM_STEPS}, loss={loss.item():.4f}, elapsed={elapsed_min:.2f} min",
                flush=True,
            )

    total_min = (time.time() - start_time) / 60.0
    print(f"Training finished. Total time: {total_min:.2f} minutes", flush=True)


if __name__ == "__main__":
    main()

✅ Step2: 作业脚本：mini_gpt_job.sh

powershell 复制代码

vi mini_gpt_job.sh

powershell 复制代码

#!/bin/bash
#$ -l h_rt=1:00:00        # walltime limit: 1 hour
#$ -cwd                   # run in current working directory
#$ -l gpu=1               # request 1 GPU
#$ -pe smp 8              # request 8 CPU cores (required per GPU)
#$ -l h_vmem=8G           # 8 GB RAM per core

echo "Job running on:"
hostname

echo "GPU allocated:"
echo $SGE_HGR_gpu

# Load Miniforge (conda) module
module load miniforge/25.3.0

# Properly initialize conda in a non-interactive shell
source "$(conda info --base)/etc/profile.d/conda.sh"

# Activate the LLM environment
conda activate llm_gpu

echo "Starting Mini-GPT training..."
python3 mini_gpt_train.py

echo "Job finished."

退出编辑

powershell 复制代码

chmod +x mini_gpt_job.sh

✅ Step3: 提交

powershell 复制代码

qsub mini_gpt_job.sh

会返回

powershell 复制代码

Your job 6653935 ("mini_gpt_job.sh") has been submitted

跑完的话把这个id发给管理员就可以了，中途查看状态

powershell 复制代码

qstat
qstat -j <jobid>

查看运行结果

powershell 复制代码

cat mini_gpt_job.sh.o6653935   # stdout（主要输出）
cat mini_gpt_job.sh.e6653935 # stderr（错误输出）