认知神经科学研究报告【20260072】

Technical Report: Transformer-Based Chinese Text Generation with Sliding Window and Gradient Optimization

Abstract

This report presents a transformer-based language model for Chinese text generation, trained on a 31MB corpus of 1599 text files. The model employs a sliding window approach to extract fixed-length token sequences, a custom vocabulary of 266,216 tokens, and a causal transformer architecture with 141.85 million parameters. Training incorporates gradient accumulation, mixed precision (FP16), and gradient checkpointing to manage GPU memory (15.77GB total). After one epoch (37 minutes, 21,049 batches), the model achieves a cross-entropy loss of 7.51 (perplexity ≈ 1827). Generated samples show emerging linguistic patterns (e.g., "天气 是 主要" -- "weather is mainly") despite limited training. The system supports checkpoint resumption and standalone text generation.

1. Introduction

Large language models have achieved remarkable success in natural language generation, but training from scratch on domain-specific corpora remains computationally intensive. This work develops a lightweight transformer model tailored for Chinese text generation, leveraging a custom pipeline for data preprocessing, streaming dataset handling, and memory-efficient training. The primary contributions are:

  • A sliding-window preprocessing pipeline using spaCy for Chinese tokenization, producing fixed-length sequences.
  • A vocabulary of 266,216 tokens constructed from a 31MB corpus of 1599 text files.
  • A causal transformer with 4 layers, 8 attention heads, and embedding dimension 256.
  • Training optimizations: gradient accumulation, automatic mixed precision (AMP), gradient checkpointing, and periodic checkpoint saving.
  • A separate generation script supporting arbitrary starting prompts and temperature-controlled sampling.

2. Methodology

2.1 Data Preprocessing

All .txt files are recursively collected from a root directory. Each file is read with automatic encoding detection (UTF-8, GBK, GB18030) and tokenized using the spaCy Chinese model zh_core_web_sm with non-essential components disabled (tagger, parser, NER, lemmatizer) to improve speed.

A sliding window extracts consecutive token windows of length W (default 100) with step S (default 50). For each window, tokens are mapped to integer IDs using a pre-built vocabulary. The vocabulary is built offline from the entire corpus, containing common tokens with frequency ≥2, plus special tokens: <PAD> (0), <UNK> (1), <BOS> (2), <EOS> (3). The final vocabulary size is 266,216.

2.2 Model Architecture

The transformer model is a causal language model inspired by GPT. Key parameters:

  • Embedding dimension d_model = 256
  • Number of attention heads nhead = 8
  • Number of transformer layers num_layers = 4
  • Maximum sequence length max_seq_len = 100 (matches window size)
  • Dropout rate = 0.1

The model consists of a token embedding layer, a positional embedding layer, a stack of transformer encoder layers with causal masking (upper triangular mask), and a linear output head projecting to vocabulary size. Total trainable parameters: 141,854,696.

2.3 Training Setup

Training uses the following configuration:

  • Optimizer: AdamW (learning rate 1e-3, weight decay 0.01)
  • Loss function: Cross-entropy (ignoring padding)
  • Batch size per device: 8
  • Gradient accumulation steps: 2 (effective batch size = 16)
  • Mixed precision: FP16 with GradScaler
  • Gradient checkpointing: enabled
  • Checkpoint saving: every epoch
  • Number of epochs: 5 (one epoch completed as of report)

Data is loaded via a custom IterableDataset that yields fixed-length windows without pre-loading everything into memory, enabling processing of arbitrarily large corpora. Multiple worker processes (default 4) parallelize file reading and tokenization.

2.4 Generation Procedure

The generation script loads a saved checkpoint and performs autoregressive sampling:

  1. Tokenize the starting prompt (or use <BOS> if empty).
  2. For each step, feed the current sequence into the model (truncating to max_seq_len if needed).
  3. Apply temperature scaling to the final logits.
  4. Sample the next token from the softmax distribution.
  5. Append the token and repeat until <EOS> or maximum length reached.

3. Experiments

3.1 Training Progress

After one epoch (21,049 batches, 37 minutes), the average loss was 7.51, corresponding to a perplexity of exp(7.51) ≈ 1827. The training speed was 9.39 iterations per second on an NVIDIA GPU with 15.77GB memory (peak usage ~6-8GB).

3.2 Generation Examples

Given the starting prompt "天气" (weather), the model at epoch 1 generated five samples:

  1. 天气 是 主要 也 识别 的 最 形式 事情 、 , 一 农耕 重要 。 危机 水平 , 理性 会 兴趣
  2. 天气 3% 的 年 我 , 那边 文化 做 现在 往 层次 对 古方 相互 当时 限度 啊 产权 航空 全国
  3. 天气 是 的 全面 随着 呢 国家 的 一 的 所有 以上 如果 , 状况 导致 , 好坏 , 呢 ,
  4. 天气 这些 这个 的 碧绿 的 强 的 , 的 影响 , 高 产品 研究 , 这个 机构 公演 , 注意
  5. 天气 我 星际 重力 , 而 卫星 呢 先生 小 , , 言行 和 与 的 考虑 现在 比较 可以 ,

Observations:

  • Frequent repetition of "的" and punctuation.
  • Some meaningful bigrams appear ("天气 是", "农耕 重要", "危机 水平").
  • The model has not yet learned long-range coherence but shows statistical patterns beyond random noise.

3.3 Resource Usage

  • GPU memory: ~6.5 GB (including activations, gradients, and optimizer states)
  • CPU memory: ~4 GB (streaming dataset keeps minimal overhead)
  • Disk storage: 520 MB for checkpoints (model weights only)

4. Discussion

4.1 Performance Analysis

The loss of 7.51 is significantly lower than random guessing (log(266216) ≈ 12.49), indicating that the model has learned basic word co-occurrence and local ordering. The generation quality, while still broken, contains recognizable Chinese morphemes and simple syntactic fragments. With additional epochs (expected loss ~5.0 after 5 epochs), the model should produce more coherent short sentences.

4.2 Limitations

  • Vocabulary size (266k) leads to a large output projection layer (~68M parameters), limiting scalability.
  • The sliding window size of 100 restricts long-distance dependencies; larger windows would improve context but increase memory.
  • Training is done from scratch; fine-tuning a pre-trained Chinese GPT would achieve better results with less data.

4.3 Future Work

  • Increase model capacity (d_model=512, layers=6) if GPU memory permits.
  • Implement learning rate scheduling (cosine decay) to improve convergence.
  • Add validation split to detect overfitting.
  • Explore top-k/top-p sampling to enhance generation diversity.
  • Convert the pipeline to support HuggingFace's transformers for easier integration.

5. Conclusion

This report documents the design and initial training of a transformer-based Chinese text generation model on a 31MB corpus. The model, featuring 141.85 million parameters, achieved a loss of 7.51 after one epoch and demonstrated emerging linguistic patterns. With optimizations for memory efficiency (gradient accumulation, mixed precision, checkpointing), the system runs within 8GB GPU memory and can be easily resumed or used for inference. While the generation quality is still rudimentary, the framework provides a solid baseline for further experimentation and scaling.

References

相关推荐
转转技术团队3 小时前
没有测试的核心代码,怎么交给 AI 重构
人工智能
爱读源码的大都督4 小时前
Claude Code源码分析(三):为什么系统提示词中需要有tools呢?
前端·人工智能·后端
半个落月5 小时前
LLM如何预测下一个Token?一文拆解Transformer核心流程
人工智能
触底反弹5 小时前
🔥 2026 年爆火的 Harness Engineering 到底是什么?从原理到实战一文讲透
javascript·人工智能·程序员
user4465117917915 小时前
源码深读 XAgent:6 个 Agent 怎么分工?工具失败不崩、死循环怎么防?
人工智能
魏祖潇5 小时前
SDD 完整指南——Spec 端打底、Story 端交付、留白区
人工智能·后端
常丛丛5 小时前
5.9 式输出:实时查看 LangGraph Agent 思考过程
人工智能
Token炼金师5 小时前
从节点图到低秩矩阵:ComfyUI 推理引擎与 LoRA 适配机制拆解
人工智能·aigc
武子康5 小时前
调查研究-210 Netflix 用 AI 复刻 Gene Wilder 的声音:语音克隆的下半场,不是模型,而是权利
人工智能·aigc·openai