文章目录
是的,建议你固定记录这 4 类信息:
- 训练参数
- 训练过程与产物
- 评估参数
- 评估结果
下面这份可以直接复制到你的实验文档里,作为第一个样例。
md
# 实验记录 001
## 实验名称
nanoGPT GPT-2 Small (124M) 在 OpenWebText 上的基线预训练与独立评估
## 实验目的
建立后续 Transformer 架构修改/消融实验的基线结果,确认:
1. 训练链路可用
2. 独立评估脚本可用
3. 当前 124M 基线模型的 loss / ppl 水平可作为后续对照
## 日期
训练运行记录对应:2026-04-06
评估运行记录对应:2026-04-07
## 代码与环境
- 项目路径:`/home/qjh/llm_learning/transformer_research/nanoGPT`
- 训练脚本:`train.py`
- 评估脚本:`eval.py`
- 训练配置:`config/train_gpt2_100k_1third.py`
- 评估配置:`config/eval_gpt2_100k.py`
- Python 环境:`conda activate transformer_research`
- 设备:`2 x NVIDIA H200`
- dtype:`bfloat16`
- W&B:在线同步
## 数据集信息
- 原始数据路径:`/home/qjh/llm_learning/opentext_dataset`
- 预处理后数据目录:`/home/qjh/llm_learning/transformer_research/nanoGPT/data/openwebtext`
- 总文档数:`8,013,769`
- 训练集文档数:`~8,009,762`
- 验证集文档数:`~4,007`
- 训练集 token 数:`9,035,582,198`
- 验证集 token 数:`4,434,897`
## 模型配置
- 模型:GPT-2 Small / 124M
- `n_layer = 12`
- `n_head = 12`
- `n_embd = 768`
- `block_size = 1024`
- `dropout = 0.0`
- `bias = False`
## 训练配置
- `dataset = 'openwebtext'`
- `gradient_accumulation_steps = 2`
- `batch_size = 128`
- `block_size = 1024`
- `learning_rate = 6e-4`
- `max_iters = 31288`
- `weight_decay = 1e-1`
- `beta1 = 0.9`
- `beta2 = 0.95`
- `grad_clip = 1.0`
- `decay_lr = True`
- `warmup_iters = 2000`
- `lr_decay_iters = 31288`
- `eval_interval = 500`
- `log_interval = 10`
- `eval_iters = 100`
- `always_save_checkpoint = True`
- `init_from = 'scratch'`
- `out_dir = 'out-gpt2-124M-100k-31288'`
## 训练命令
```bash
source ~/miniconda3/etc/profile.d/conda.sh && \
conda activate transformer_research && \
cd /home/qjh/llm_learning/transformer_research/nanoGPT && \
export WANDB_MODE=online && \
nohup torchrun --standalone --nproc_per_node=2 train.py config/train_gpt2_100k_1third.py > train_100k_optimized_128_31288_0404.log 2>&1 &
训练预算换算
根据 train.py 的 DDP 逻辑,gradient_accumulation_steps 会被 ddp_world_size=2 整除,因此实际全局 token/iter 为:
- 实际全局
tokens_per_iter = 2 * 128 * 1024 = 262,144
总训练 token 数约为:
262,144 * 31,288 = 8,201,961,472
折合训练集 token-epoch:
8.20B / 9.04B ≈ 0.91 token-epoch
说明:
本次训练预算接近 1 个 token 级别 epoch,不属于 nanoGPT README 中 300B token 的 full reproduction 预算。
训练过程记录
训练日志文件:
train_100k_optimized_128_31288_0404.log
训练尾部日志摘录:
text
iter 31080: loss 3.0709, time 474.08ms, mfu 11.47%
iter 31090: loss 3.0994, time 466.29ms, mfu 11.54%
iter 31100: loss 3.0691, time 581.52ms, mfu 11.36%
iter 31110: loss 3.0200, time 464.12ms, mfu 11.44%
iter 31120: loss 3.0757, time 470.27ms, mfu 11.50%
iter 31130: loss 3.0245, time 467.35ms, mfu 11.56%
iter 31140: loss 3.0185, time 467.11ms, mfu 11.62%
iter 31150: loss 3.0040, time 474.78ms, mfu 11.65%
iter 31160: loss 3.0987, time 470.05ms, mfu 11.69%
iter 31170: loss 3.0482, time 467.03ms, mfu 11.73%
iter 31180: loss 3.0875, time 474.27ms, mfu 11.75%
iter 31190: loss 3.1008, time 460.32ms, mfu 11.81%
iter 31200: loss 3.0692, time 587.41ms, mfu 11.59%
iter 31210: loss 3.0625, time 468.34ms, mfu 11.64%
iter 31220: loss 3.0336, time 477.80ms, mfu 11.66%
iter 31230: loss 3.0616, time 465.00ms, mfu 11.71%
iter 31240: loss 3.0539, time 474.41ms, mfu 11.73%
iter 31250: loss 3.0166, time 480.89ms, mfu 11.74%
iter 31260: loss 3.0696, time 470.57ms, mfu 11.77%
iter 31270: loss 3.0194, time 473.77ms, mfu 11.79%
iter 31280: loss 3.0806, time 471.91ms, mfu 11.81%
训练 W&B 信息:
- run name:
gpt2-124M-100k-04062159 - run id:
kvc6ftu0
训练产物
- checkpoint 目录:
out-gpt2-124M-100k-31288 - 本次评估使用 checkpoint:
out-gpt2-124M-100k-31288/ckpt.pt - checkpoint 元信息:
checkpoint_iter_num = 31000checkpoint_best_val_loss = 3.0682730674743652
说明:
虽然训练 stdout 继续打印到了 31280 附近,但 checkpoint 是按 eval_interval=500 保存的,因此本次评估实际加载的是 iter=31000 的保存点。
评估目的
使用独立评估脚本 eval.py 做纯评测,避免 train.py 中 eval_only 与 resume 路径耦合导致的继续训练问题。
评估配置
init_from = 'resume'out_dir = 'out-gpt2-124M-100k-31288'dataset = 'openwebtext'splits = ['val']batch_size = 64block_size = 1024eval_iters = 63log_interval = 10device = 'cuda'dtype = 'bfloat16'compile = Falsewandb_log = True
评估 token 预算:
64 * 1024 * 63 = 4,128,768 tokens
该值接近整个验证集 token 数量级,因此本次 val 评估结果具有代表性。
评估命令
bash
python eval.py config/eval_gpt2_100k.py --splits="['val']"
评估过程日志
text
eval val: step 1/63, loss 3.1506, running_loss 3.1506, running_ppl 23.3490
eval val: step 10/63, loss 3.0766, running_loss 3.0677, running_ppl 21.4929
eval val: step 20/63, loss 3.0647, running_loss 3.0677, running_ppl 21.4920
eval val: step 30/63, loss 3.0949, running_loss 3.0693, running_ppl 21.5278
eval val: step 40/63, loss 3.1713, running_loss 3.0774, running_ppl 21.7012
eval val: step 50/63, loss 3.0629, running_loss 3.0732, running_ppl 21.6107
eval val: step 60/63, loss 3.0045, running_loss 3.0729, running_ppl 21.6038
eval val: step 63/63, loss 3.0183, running_loss 3.0724, running_ppl 21.5941
最终评估结果
json
{
"batch_size": 64,
"block_size": 1024,
"checkpoint_best_val_loss": 3.0682730674743652,
"checkpoint_iter_num": 31000,
"dataset": "openwebtext",
"device": "cuda",
"dtype": "bfloat16",
"eval_iters": 63,
"init_from": "resume",
"out_dir": "out-gpt2-124M-100k-31288",
"splits": ["val"],
"val": {
"loss": 3.072420835494995,
"loss_std": 0.044102780520915985,
"perplexity": 21.59411525115754
}
}
结果解读
- 正式应报告的指标是最终
val.loss = 3.0724 step k中的单步loss只是当前 batch loss,不作为正式结论running_loss是到当前 step 为止的平均 loss,最后一步的running_loss等于最终val.lossval.perplexity = exp(val.loss) = 21.5941- 本次评估结果与 checkpoint 内记录的
best_val_loss = 3.0683非常接近,说明独立评估脚本工作正常,checkpoint 状态可信
与仓库 README 标准的关系
nanoGPT README 给出的 OpenWebText 参考线:
- 原始 GPT-2(124M) 在 OWT 上:
val loss ≈ 3.12 - 完整训练复现目标:
loss ≈ 2.85
本实验结果:
val loss = 3.0724
因此当前结果可解释为:
- 已优于 README 中 raw GPT-2 on OWT baseline
- 但尚未达到 README 中 fully trained reproduction (
~2.85) 的目标 - 作为后续架构修改/消融实验的 baseline 是合格的
本实验的用途
本实验可作为后续 Transformer 架构探索的基线对照:
- 所有新架构应尽量在相同数据、相同 token 预算、相同优化器配置下训练
- 后续比较时建议优先对比:
val.lossval.perplexity- 参数量变化
- 训练稳定性
- 当前 baseline 适合用作后续消融实验的第一参照点
后续建议
-
建议把后续实验统一记录:
- 架构改动点
- 参数量
- 总训练 token 数
- 训练命令
- checkpoint 路径
- 评估命令
val.loss / ppl
-
若某个改动在同预算下明显优于本基线,再考虑增加训练 token 做第二轮验证
如果你愿意,我下一步可以继续帮你做一件很实用的事:给你整理一份统一实验记录模板,把可变字段留成占位符,后面每次新实验直接复制改几行就能用了。