Technical Report: Transformer-Based Chinese Text Generation with Sliding Window and Gradient Optimization
Abstract
This report presents a transformer-based language model for Chinese text generation, trained on a 31MB corpus of 1599 text files. The model employs a sliding window approach to extract fixed-length token sequences, a custom vocabulary of 266,216 tokens, and a causal transformer architecture with 141.85 million parameters. Training incorporates gradient accumulation, mixed precision (FP16), and gradient checkpointing to manage GPU memory (15.77GB total). After one epoch (37 minutes, 21,049 batches), the model achieves a cross-entropy loss of 7.51 (perplexity ≈ 1827). Generated samples show emerging linguistic patterns (e.g., "天气 是 主要" -- "weather is mainly") despite limited training. The system supports checkpoint resumption and standalone text generation.
1. Introduction
Large language models have achieved remarkable success in natural language generation, but training from scratch on domain-specific corpora remains computationally intensive. This work develops a lightweight transformer model tailored for Chinese text generation, leveraging a custom pipeline for data preprocessing, streaming dataset handling, and memory-efficient training. The primary contributions are:
- A sliding-window preprocessing pipeline using spaCy for Chinese tokenization, producing fixed-length sequences.
- A vocabulary of 266,216 tokens constructed from a 31MB corpus of 1599 text files.
- A causal transformer with 4 layers, 8 attention heads, and embedding dimension 256.
- Training optimizations: gradient accumulation, automatic mixed precision (AMP), gradient checkpointing, and periodic checkpoint saving.
- A separate generation script supporting arbitrary starting prompts and temperature-controlled sampling.
2. Methodology
2.1 Data Preprocessing
All .txt files are recursively collected from a root directory. Each file is read with automatic encoding detection (UTF-8, GBK, GB18030) and tokenized using the spaCy Chinese model zh_core_web_sm with non-essential components disabled (tagger, parser, NER, lemmatizer) to improve speed.
A sliding window extracts consecutive token windows of length W (default 100) with step S (default 50). For each window, tokens are mapped to integer IDs using a pre-built vocabulary. The vocabulary is built offline from the entire corpus, containing common tokens with frequency ≥2, plus special tokens: <PAD> (0), <UNK> (1), <BOS> (2), <EOS> (3). The final vocabulary size is 266,216.
2.2 Model Architecture
The transformer model is a causal language model inspired by GPT. Key parameters:
- Embedding dimension
d_model= 256 - Number of attention heads
nhead= 8 - Number of transformer layers
num_layers= 4 - Maximum sequence length
max_seq_len= 100 (matches window size) - Dropout rate = 0.1
The model consists of a token embedding layer, a positional embedding layer, a stack of transformer encoder layers with causal masking (upper triangular mask), and a linear output head projecting to vocabulary size. Total trainable parameters: 141,854,696.
2.3 Training Setup
Training uses the following configuration:
- Optimizer: AdamW (learning rate 1e-3, weight decay 0.01)
- Loss function: Cross-entropy (ignoring padding)
- Batch size per device: 8
- Gradient accumulation steps: 2 (effective batch size = 16)
- Mixed precision: FP16 with GradScaler
- Gradient checkpointing: enabled
- Checkpoint saving: every epoch
- Number of epochs: 5 (one epoch completed as of report)
Data is loaded via a custom IterableDataset that yields fixed-length windows without pre-loading everything into memory, enabling processing of arbitrarily large corpora. Multiple worker processes (default 4) parallelize file reading and tokenization.
2.4 Generation Procedure
The generation script loads a saved checkpoint and performs autoregressive sampling:
- Tokenize the starting prompt (or use
<BOS>if empty). - For each step, feed the current sequence into the model (truncating to
max_seq_lenif needed). - Apply temperature scaling to the final logits.
- Sample the next token from the softmax distribution.
- Append the token and repeat until
<EOS>or maximum length reached.
3. Experiments
3.1 Training Progress
After one epoch (21,049 batches, 37 minutes), the average loss was 7.51, corresponding to a perplexity of exp(7.51) ≈ 1827. The training speed was 9.39 iterations per second on an NVIDIA GPU with 15.77GB memory (peak usage ~6-8GB).
3.2 Generation Examples
Given the starting prompt "天气" (weather), the model at epoch 1 generated five samples:
天气 是 主要 也 识别 的 最 形式 事情 、 , 一 农耕 重要 。 危机 水平 , 理性 会 兴趣天气 3% 的 年 我 , 那边 文化 做 现在 往 层次 对 古方 相互 当时 限度 啊 产权 航空 全国天气 是 的 全面 随着 呢 国家 的 一 的 所有 以上 如果 , 状况 导致 , 好坏 , 呢 ,天气 这些 这个 的 碧绿 的 强 的 , 的 影响 , 高 产品 研究 , 这个 机构 公演 , 注意天气 我 星际 重力 , 而 卫星 呢 先生 小 , , 言行 和 与 的 考虑 现在 比较 可以 ,
Observations:
- Frequent repetition of "的" and punctuation.
- Some meaningful bigrams appear ("天气 是", "农耕 重要", "危机 水平").
- The model has not yet learned long-range coherence but shows statistical patterns beyond random noise.
3.3 Resource Usage
- GPU memory: ~6.5 GB (including activations, gradients, and optimizer states)
- CPU memory: ~4 GB (streaming dataset keeps minimal overhead)
- Disk storage: 520 MB for checkpoints (model weights only)
4. Discussion
4.1 Performance Analysis
The loss of 7.51 is significantly lower than random guessing (log(266216) ≈ 12.49), indicating that the model has learned basic word co-occurrence and local ordering. The generation quality, while still broken, contains recognizable Chinese morphemes and simple syntactic fragments. With additional epochs (expected loss ~5.0 after 5 epochs), the model should produce more coherent short sentences.
4.2 Limitations
- Vocabulary size (266k) leads to a large output projection layer (~68M parameters), limiting scalability.
- The sliding window size of 100 restricts long-distance dependencies; larger windows would improve context but increase memory.
- Training is done from scratch; fine-tuning a pre-trained Chinese GPT would achieve better results with less data.
4.3 Future Work
- Increase model capacity (d_model=512, layers=6) if GPU memory permits.
- Implement learning rate scheduling (cosine decay) to improve convergence.
- Add validation split to detect overfitting.
- Explore top-k/top-p sampling to enhance generation diversity.
- Convert the pipeline to support HuggingFace's
transformersfor easier integration.
5. Conclusion
This report documents the design and initial training of a transformer-based Chinese text generation model on a 31MB corpus. The model, featuring 141.85 million parameters, achieved a loss of 7.51 after one epoch and demonstrated emerging linguistic patterns. With optimizations for memory efficiency (gradient accumulation, mixed precision, checkpointing), the system runs within 8GB GPU memory and can be easily resumed or used for inference. While the generation quality is still rudimentary, the framework provides a solid baseline for further experimentation and scaling.
References
- Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS.
- spaCy: Industrial-strength NLP (https://spacy.io)
- PyTorch: Automatic Mixed Precision (https://pytorch.org/docs/stable/amp.html)