认知神经科学研究报告【20260072】

Technical Report: Transformer-Based Chinese Text Generation with Sliding Window and Gradient Optimization

Abstract

This report presents a transformer-based language model for Chinese text generation, trained on a 31MB corpus of 1599 text files. The model employs a sliding window approach to extract fixed-length token sequences, a custom vocabulary of 266,216 tokens, and a causal transformer architecture with 141.85 million parameters. Training incorporates gradient accumulation, mixed precision (FP16), and gradient checkpointing to manage GPU memory (15.77GB total). After one epoch (37 minutes, 21,049 batches), the model achieves a cross-entropy loss of 7.51 (perplexity ≈ 1827). Generated samples show emerging linguistic patterns (e.g., "天气是主要" -- "weather is mainly") despite limited training. The system supports checkpoint resumption and standalone text generation.

1. Introduction

Large language models have achieved remarkable success in natural language generation, but training from scratch on domain-specific corpora remains computationally intensive. This work develops a lightweight transformer model tailored for Chinese text generation, leveraging a custom pipeline for data preprocessing, streaming dataset handling, and memory-efficient training. The primary contributions are:

A sliding-window preprocessing pipeline using spaCy for Chinese tokenization, producing fixed-length sequences.
A vocabulary of 266,216 tokens constructed from a 31MB corpus of 1599 text files.
A causal transformer with 4 layers, 8 attention heads, and embedding dimension 256.
Training optimizations: gradient accumulation, automatic mixed precision (AMP), gradient checkpointing, and periodic checkpoint saving.
A separate generation script supporting arbitrary starting prompts and temperature-controlled sampling.

2. Methodology

2.1 Data Preprocessing

All .txt files are recursively collected from a root directory. Each file is read with automatic encoding detection (UTF-8, GBK, GB18030) and tokenized using the spaCy Chinese model zh_core_web_sm with non-essential components disabled (tagger, parser, NER, lemmatizer) to improve speed.

A sliding window extracts consecutive token windows of length W (default 100) with step S (default 50). For each window, tokens are mapped to integer IDs using a pre-built vocabulary. The vocabulary is built offline from the entire corpus, containing common tokens with frequency ≥2, plus special tokens: <PAD> (0), <UNK> (1), <BOS> (2), <EOS> (3). The final vocabulary size is 266,216.

2.2 Model Architecture

The transformer model is a causal language model inspired by GPT. Key parameters:

Embedding dimension d_model = 256
Number of attention heads nhead = 8
Number of transformer layers num_layers = 4
Maximum sequence length max_seq_len = 100 (matches window size)
Dropout rate = 0.1

The model consists of a token embedding layer, a positional embedding layer, a stack of transformer encoder layers with causal masking (upper triangular mask), and a linear output head projecting to vocabulary size. Total trainable parameters: 141,854,696.

2.3 Training Setup

Training uses the following configuration:

Optimizer: AdamW (learning rate 1e-3, weight decay 0.01)
Loss function: Cross-entropy (ignoring padding)
Batch size per device: 8
Gradient accumulation steps: 2 (effective batch size = 16)
Mixed precision: FP16 with GradScaler
Gradient checkpointing: enabled
Checkpoint saving: every epoch
Number of epochs: 5 (one epoch completed as of report)

Data is loaded via a custom IterableDataset that yields fixed-length windows without pre-loading everything into memory, enabling processing of arbitrarily large corpora. Multiple worker processes (default 4) parallelize file reading and tokenization.

2.4 Generation Procedure

The generation script loads a saved checkpoint and performs autoregressive sampling:

Tokenize the starting prompt (or use <BOS> if empty).
For each step, feed the current sequence into the model (truncating to max_seq_len if needed).
Apply temperature scaling to the final logits.
Sample the next token from the softmax distribution.
Append the token and repeat until <EOS> or maximum length reached.

3. Experiments

3.1 Training Progress

After one epoch (21,049 batches, 37 minutes), the average loss was 7.51, corresponding to a perplexity of exp(7.51) ≈ 1827. The training speed was 9.39 iterations per second on an NVIDIA GPU with 15.77GB memory (peak usage ~6-8GB).

3.2 Generation Examples

Given the starting prompt "天气" (weather), the model at epoch 1 generated five samples:

天气是主要也识别的最形式事情、，一农耕重要。危机水平，理性会兴趣
天气 3％的年我，那边文化做现在往层次对古方相互当时限度啊产权航空全国
天气是的全面随着呢国家的一的所有以上如果，状况导致，好坏，呢，
天气这些这个的碧绿的强的，的影响，高产品研究，这个机构公演，注意
天气我星际重力，而卫星呢先生小，，言行和与的考虑现在比较可以，

Observations:

Frequent repetition of "的" and punctuation.
Some meaningful bigrams appear ("天气是", "农耕重要", "危机水平").
The model has not yet learned long-range coherence but shows statistical patterns beyond random noise.

3.3 Resource Usage

GPU memory: ~6.5 GB (including activations, gradients, and optimizer states)
CPU memory: ~4 GB (streaming dataset keeps minimal overhead)
Disk storage: 520 MB for checkpoints (model weights only)

4. Discussion

4.1 Performance Analysis

The loss of 7.51 is significantly lower than random guessing (log(266216) ≈ 12.49), indicating that the model has learned basic word co-occurrence and local ordering. The generation quality, while still broken, contains recognizable Chinese morphemes and simple syntactic fragments. With additional epochs (expected loss ~5.0 after 5 epochs), the model should produce more coherent short sentences.

4.2 Limitations

Vocabulary size (266k) leads to a large output projection layer (~68M parameters), limiting scalability.
The sliding window size of 100 restricts long-distance dependencies; larger windows would improve context but increase memory.
Training is done from scratch; fine-tuning a pre-trained Chinese GPT would achieve better results with less data.

4.3 Future Work

Increase model capacity (d_model=512, layers=6) if GPU memory permits.
Implement learning rate scheduling (cosine decay) to improve convergence.
Add validation split to detect overfitting.
Explore top-k/top-p sampling to enhance generation diversity.
Convert the pipeline to support HuggingFace's transformers for easier integration.

5. Conclusion

This report documents the design and initial training of a transformer-based Chinese text generation model on a 31MB corpus. The model, featuring 141.85 million parameters, achieved a loss of 7.51 after one epoch and demonstrated emerging linguistic patterns. With optimizations for memory efficiency (gradient accumulation, mixed precision, checkpointing), the system runs within 8GB GPU memory and can be easily resumed or used for inference. While the generation quality is still rudimentary, the framework provides a solid baseline for further experimentation and scaling.

References

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS.
spaCy: Industrial-strength NLP (https://spacy.io)
PyTorch: Automatic Mixed Precision (https://pytorch.org/docs/stable/amp.html)