认知神经科学研究报告【20260072】

Technical Report: Transformer-Based Chinese Text Generation with Sliding Window and Gradient Optimization

Abstract

This report presents a transformer-based language model for Chinese text generation, trained on a 31MB corpus of 1599 text files. The model employs a sliding window approach to extract fixed-length token sequences, a custom vocabulary of 266,216 tokens, and a causal transformer architecture with 141.85 million parameters. Training incorporates gradient accumulation, mixed precision (FP16), and gradient checkpointing to manage GPU memory (15.77GB total). After one epoch (37 minutes, 21,049 batches), the model achieves a cross-entropy loss of 7.51 (perplexity ≈ 1827). Generated samples show emerging linguistic patterns (e.g., "天气 是 主要" -- "weather is mainly") despite limited training. The system supports checkpoint resumption and standalone text generation.

1. Introduction

Large language models have achieved remarkable success in natural language generation, but training from scratch on domain-specific corpora remains computationally intensive. This work develops a lightweight transformer model tailored for Chinese text generation, leveraging a custom pipeline for data preprocessing, streaming dataset handling, and memory-efficient training. The primary contributions are:

  • A sliding-window preprocessing pipeline using spaCy for Chinese tokenization, producing fixed-length sequences.
  • A vocabulary of 266,216 tokens constructed from a 31MB corpus of 1599 text files.
  • A causal transformer with 4 layers, 8 attention heads, and embedding dimension 256.
  • Training optimizations: gradient accumulation, automatic mixed precision (AMP), gradient checkpointing, and periodic checkpoint saving.
  • A separate generation script supporting arbitrary starting prompts and temperature-controlled sampling.

2. Methodology

2.1 Data Preprocessing

All .txt files are recursively collected from a root directory. Each file is read with automatic encoding detection (UTF-8, GBK, GB18030) and tokenized using the spaCy Chinese model zh_core_web_sm with non-essential components disabled (tagger, parser, NER, lemmatizer) to improve speed.

A sliding window extracts consecutive token windows of length W (default 100) with step S (default 50). For each window, tokens are mapped to integer IDs using a pre-built vocabulary. The vocabulary is built offline from the entire corpus, containing common tokens with frequency ≥2, plus special tokens: <PAD> (0), <UNK> (1), <BOS> (2), <EOS> (3). The final vocabulary size is 266,216.

2.2 Model Architecture

The transformer model is a causal language model inspired by GPT. Key parameters:

  • Embedding dimension d_model = 256
  • Number of attention heads nhead = 8
  • Number of transformer layers num_layers = 4
  • Maximum sequence length max_seq_len = 100 (matches window size)
  • Dropout rate = 0.1

The model consists of a token embedding layer, a positional embedding layer, a stack of transformer encoder layers with causal masking (upper triangular mask), and a linear output head projecting to vocabulary size. Total trainable parameters: 141,854,696.

2.3 Training Setup

Training uses the following configuration:

  • Optimizer: AdamW (learning rate 1e-3, weight decay 0.01)
  • Loss function: Cross-entropy (ignoring padding)
  • Batch size per device: 8
  • Gradient accumulation steps: 2 (effective batch size = 16)
  • Mixed precision: FP16 with GradScaler
  • Gradient checkpointing: enabled
  • Checkpoint saving: every epoch
  • Number of epochs: 5 (one epoch completed as of report)

Data is loaded via a custom IterableDataset that yields fixed-length windows without pre-loading everything into memory, enabling processing of arbitrarily large corpora. Multiple worker processes (default 4) parallelize file reading and tokenization.

2.4 Generation Procedure

The generation script loads a saved checkpoint and performs autoregressive sampling:

  1. Tokenize the starting prompt (or use <BOS> if empty).
  2. For each step, feed the current sequence into the model (truncating to max_seq_len if needed).
  3. Apply temperature scaling to the final logits.
  4. Sample the next token from the softmax distribution.
  5. Append the token and repeat until <EOS> or maximum length reached.

3. Experiments

3.1 Training Progress

After one epoch (21,049 batches, 37 minutes), the average loss was 7.51, corresponding to a perplexity of exp(7.51) ≈ 1827. The training speed was 9.39 iterations per second on an NVIDIA GPU with 15.77GB memory (peak usage ~6-8GB).

3.2 Generation Examples

Given the starting prompt "天气" (weather), the model at epoch 1 generated five samples:

  1. 天气 是 主要 也 识别 的 最 形式 事情 、 , 一 农耕 重要 。 危机 水平 , 理性 会 兴趣
  2. 天气 3% 的 年 我 , 那边 文化 做 现在 往 层次 对 古方 相互 当时 限度 啊 产权 航空 全国
  3. 天气 是 的 全面 随着 呢 国家 的 一 的 所有 以上 如果 , 状况 导致 , 好坏 , 呢 ,
  4. 天气 这些 这个 的 碧绿 的 强 的 , 的 影响 , 高 产品 研究 , 这个 机构 公演 , 注意
  5. 天气 我 星际 重力 , 而 卫星 呢 先生 小 , , 言行 和 与 的 考虑 现在 比较 可以 ,

Observations:

  • Frequent repetition of "的" and punctuation.
  • Some meaningful bigrams appear ("天气 是", "农耕 重要", "危机 水平").
  • The model has not yet learned long-range coherence but shows statistical patterns beyond random noise.

3.3 Resource Usage

  • GPU memory: ~6.5 GB (including activations, gradients, and optimizer states)
  • CPU memory: ~4 GB (streaming dataset keeps minimal overhead)
  • Disk storage: 520 MB for checkpoints (model weights only)

4. Discussion

4.1 Performance Analysis

The loss of 7.51 is significantly lower than random guessing (log(266216) ≈ 12.49), indicating that the model has learned basic word co-occurrence and local ordering. The generation quality, while still broken, contains recognizable Chinese morphemes and simple syntactic fragments. With additional epochs (expected loss ~5.0 after 5 epochs), the model should produce more coherent short sentences.

4.2 Limitations

  • Vocabulary size (266k) leads to a large output projection layer (~68M parameters), limiting scalability.
  • The sliding window size of 100 restricts long-distance dependencies; larger windows would improve context but increase memory.
  • Training is done from scratch; fine-tuning a pre-trained Chinese GPT would achieve better results with less data.

4.3 Future Work

  • Increase model capacity (d_model=512, layers=6) if GPU memory permits.
  • Implement learning rate scheduling (cosine decay) to improve convergence.
  • Add validation split to detect overfitting.
  • Explore top-k/top-p sampling to enhance generation diversity.
  • Convert the pipeline to support HuggingFace's transformers for easier integration.

5. Conclusion

This report documents the design and initial training of a transformer-based Chinese text generation model on a 31MB corpus. The model, featuring 141.85 million parameters, achieved a loss of 7.51 after one epoch and demonstrated emerging linguistic patterns. With optimizations for memory efficiency (gradient accumulation, mixed precision, checkpointing), the system runs within 8GB GPU memory and can be easily resumed or used for inference. While the generation quality is still rudimentary, the framework provides a solid baseline for further experimentation and scaling.

References

相关推荐
跨境猫小妹1 小时前
多国海关字段持续细化后跨境卖家如何搭建商品信息映射表
大数据·数据库·人工智能·跨境电商·跨境·营销策略
再玩一会儿看代码1 小时前
2026 年 ChatGPT 套餐怎么选?Free、Go、Plus、Pro、Business、Enterprise 一次讲清楚
人工智能·gpt·chatgpt·golang·openai·codex
网安情报局1 小时前
GPT-5.5+GPT-Image-2国内使用指南:AI聚合大模型平台实测体验
人工智能·gpt·ai
AI科技星1 小时前
第三卷:质数王朝志 第四章:RSA护国玄阵,质数锁天地,一数镇万法
android·人工智能·架构·概率论·学习方法
菜鸟分享录1 小时前
AI 学习路线 03:线性代数、概率统计、梯度下降到底有什么用?
人工智能·线性代数·ai
薛定谔的悦1 小时前
电化学阻抗谱(EIS)深度解析:从物理原理到工程代码实现
人工智能·能源·bms
IT WorryFree1 小时前
FortiGate常用资产 OID 清单,配套 Excel 台账模板字段
网络·人工智能·excel
CryptoPP1 小时前
多市场行情 API 接入实战:一套接口打通股票/外汇/期货/加密货币 + WebSocket 实时推送
大数据·网络·人工智能·websocket·网络协议·金融·区块链
Fabarta技术团队1 小时前
从「能问数」到「像分析师写报告」:AI+经营分析落地技术分享
大数据·人工智能