PDF解析器代码详解：从文件结构到交叉引用表解析

文章目录

PDF解析器代码详解：从文件结构到交叉引用表解析
- 第一部分：PDF文件整体结构
- [第二部分：StartParse - 解析入口](#第二部分：StartParse - 解析入口)
- 第三部分：定位startxref和交叉引用表
- - PDF文件结构：文件尾
  - 代码解析步骤
- [第四部分：LoadAllCrossRefV4 - 加载传统XRef表链](#第四部分：LoadAllCrossRefV4 - 加载传统XRef表链)
- - PDF文件结构：XRef表链（增量更新）
  - 代码解析步骤
- [第五部分：LoadCrossRefV4 - 解析单个XRef表](#第五部分：LoadCrossRefV4 - 解析单个XRef表)
- [第六部分：LoadAllCrossRefV5 - 加载XRef流格式](#第六部分：LoadAllCrossRefV5 - 加载XRef流格式)
- [第七部分：GetVarInt - 变长整数读取](#第七部分：GetVarInt - 变长整数读取)
- 第八部分：对象映射表（m_ObjectInfo）
- - 对象类型示例
- 第九部分：完整解析流程示例

PDF解析器代码详解：从文件结构到交叉引用表解析

我来按照PDF文件的实际结构，一步步讲解解析器是如何工作的。我会先介绍PDF格式规范，然后结合您的代码展示如何解析。

第一部分：PDF文件整体结构

一个标准的PDF文件由四部分组成：

复制代码

┌─────────────────────────────────────┐
│ 1. 文件头 (Header)                   │  ← "%PDF-1.4"
├─────────────────────────────────────┤
│ 2. 对象体 (Body)                     │  ← 间接对象：1 0 obj ... endobj
│    - 页面对象                         │     2 0 obj ... endobj
│    - 资源对象                         │     ...
├─────────────────────────────────────┤
│ 3. 交叉引用表 (XRef Table)           │  ← "xref" + 条目列表
├─────────────────────────────────────┤
│ 4. 文件尾 (Trailer)                  │  ← "trailer" + 字典 + "startxref" + 偏移 + "%%EOF"
└─────────────────────────────────────┘

第二部分：StartParse - 解析入口

PDF文件结构：文件头

pdf 复制代码

%PDF-1.4        ← 第1-7字节：%PDF-版本号
%âãÏÓ           ← 第8-11字节：二进制标记（可选）

代码解析步骤

cpp 复制代码

CPDF_Parser::Error CPDF_Parser::StartParse(
    const CFX_RetainPtr<IFX_SeekableReadStream>& pFileAccess,
    CPDF_Document* pDocument) {
  
  // 步骤1：查找PDF文件头
  int32_t offset = GetHeaderOffset(pFileAccess);
  // GetHeaderOffset() 会扫描文件前1024字节，查找"%PDF"
  // 有些文件可能在%PDF前有垃圾数据（如邮件附件、HTTP头）
  // 返回值offset是%P字符的位置
  if (offset == -1)  // 没找到PDF头
    return FORMAT_ERROR;

  // 步骤2：初始化语法解析器，跳过文件头前的垃圾数据
  m_pSyntax->InitParser(pFileAccess, offset);
  
  // PDF版本号位于文件头第6个字符和第8个字符
  // 例如 "%PDF-1.4"：'1'在位置5（0索引），'4'在位置7
  uint8_t ch;
  m_pSyntax->GetCharAt(5, ch);   // 读取主版本号
  if (std::isdigit(ch))
    m_FileVersion = FXSYS_DecimalCharToInt(ch) * 10;  // 1*10=10
  
  m_pSyntax->GetCharAt(7, ch);   // 读取次版本号
  if (std::isdigit(ch))
    m_FileVersion += FXSYS_DecimalCharToInt(ch);      // 10+4=14（表示1.4）

PDF结构示例：文件头定位

复制代码

文件内容（字节偏移）：
0: "HTTP/1.1 200 OK\r\n"     ← 垃圾数据（可能是HTTP响应头）
   "Content-Type: ...\r\n"
   "\r\n"
50: "%PDF-1.4\r\n"            ← offset=50，GetHeaderOffset返回50
    "%âãÏÓ\r\n"
    "1 0 obj\r\n"             ← 第1个对象开始

第三部分：定位startxref和交叉引用表

PDF文件结构：文件尾

pdf 复制代码

trailer                        ← 文件尾开始
<<
/Size 9                        ← 总共9个对象（0-8）
/Root 1 0 R                    ← 根目录对象引用
/Info 8 0 R                    ← 信息字典
/ID [<...>]                    ← 文件标识符
>>
startxref                      ← 关键字
54321                          ← XRef表起始字节偏移
%%EOF                          ← 文件结束标记

代码解析步骤

cpp 复制代码

  // 步骤3：定位到文件末尾，准备搜索"startxref"
  // PDF规范：文件最后1024字节内必须包含"startxref"
  // 先确保文件足够长
  if (m_pSyntax->m_FileLen < m_pSyntax->m_HeaderOffset + 9)
    return FORMAT_ERROR;

  // 定位到文件末尾前9字节（留出"%%EOF"的空间）
  m_pSyntax->SetPos(m_pSyntax->m_FileLen - m_pSyntax->m_HeaderOffset - 9);
  
  bool bXRefRebuilt = false;
  
  // 步骤4：向后搜索"startxref"关键字，最多4096字节
  // BackwardsSearchToWord 会从当前位置向文件开头方向搜索
  if (m_pSyntax->BackwardsSearchToWord("startxref", 4096)) {
    // 找到"startxref"后，记录其位置（用于排序）
    m_SortedOffset.insert(m_pSyntax->GetPos());
    
    // 读取"startxref"关键字本身
    m_pSyntax->GetKeyword();
    
    // 读取后面的数值（XRef表起始位置）
    bool bNumber;
    CFX_ByteString xrefpos_str = m_pSyntax->GetNextWord(&bNumber);
    if (!bNumber)
      return FORMAT_ERROR;
    
    m_LastXRefOffset = (FX_FILESIZE)FXSYS_atoi64(xrefpos_str.c_str());
    
    // 尝试加载交叉引用表（V4传统格式或V5流格式）
    if (!LoadAllCrossRefV4(m_LastXRefOffset) &&
        !LoadAllCrossRefV5(m_LastXRefOffset)) {
      // 都失败则重建
      if (!RebuildCrossRef())
        return FORMAT_ERROR;
      bXRefRebuilt = true;
      m_LastXRefOffset = 0;
    }
  }

第四部分：LoadAllCrossRefV4 - 加载传统XRef表链

PDF文件结构：XRef表链（增量更新）

PDF支持增量保存，每次修改追加新内容，形成链表：

复制代码

第一版文件：
┌──────────────┐
│ 原始对象      │
├──────────────┤
│ xref         │ ← XRef1 (Prev=0)
│ trailer/Prev 0│
└──────────────┘

第二版（修改后追加）：
┌──────────────┐
│ 原始对象      │
├──────────────┤
│ xref         │ ← XRef1 (Prev=0)
│ trailer      │
├──────────────┤
│ 新增/修改对象 │
├──────────────┤
│ xref         │ ← XRef2 (Prev=XRef1位置)
│ trailer/Prev │
└──────────────┘

代码解析步骤

cpp 复制代码

bool CPDF_Parser::LoadAllCrossRefV4(FX_FILESIZE xrefpos) {
  // 步骤1：加载最新的XRef表（只加载trailer，不加载条目）
  if (!LoadCrossRefV4(xrefpos, 0, true))
    return false;

  // 步骤2：加载对应的trailer字典
  m_pTrailer = LoadTrailerV4();
  if (!m_pTrailer)
    return false;

  // 步骤3：获取Size（最大对象号+1），调整对象映射表大小
  int32_t xrefsize = GetDirectInteger(m_pTrailer.get(), "Size");
  if (xrefsize > 0 && xrefsize <= kMaxXRefSize)
    ShrinkObjectMap(xrefsize);  // 删除超出size范围的对象信息

  // 步骤4：准备遍历Prev链
  std::vector<FX_FILESIZE> CrossRefList;   // 存储所有XRef位置
  std::vector<FX_FILESIZE> XRefStreamList; // 存储XRef流位置（混合模式）
  std::set<FX_FILESIZE> seen_xrefpos;      // 防止循环引用

  CrossRefList.push_back(xrefpos);
  XRefStreamList.push_back(GetDirectInteger(m_pTrailer.get(), "XRefStm"));
  seen_xrefpos.insert(xrefpos);

  // 步骤5：沿着Prev链遍历所有历史XRef表
  xrefpos = GetDirectInteger(m_pTrailer.get(), "Prev");
  while (xrefpos) {
    // 检查循环引用
    if (pdfium::ContainsKey(seen_xrefpos, xrefpos))
      return false;
    seen_xrefpos.insert(xrefpos);

    // 插入到开头，保持时间顺序（旧→新）
    CrossRefList.insert(CrossRefList.begin(), xrefpos);
    LoadCrossRefV4(xrefpos, 0, true);  // 只加载trailer

    std::unique_ptr<CPDF_Dictionary> pDict(LoadTrailerV4());
    if (!pDict)
      return false;

    xrefpos = GetDirectInteger(pDict.get(), "Prev");
    XRefStreamList.insert(XRefStreamList.begin(),
                          pDict->GetIntegerFor("XRefStm"));
    m_Trailers.push_back(std::move(pDict));
  }

  // 步骤6：按时间顺序（从旧到新）加载所有XRef条目
  // 这样最新的条目会覆盖旧的条目
  for (size_t i = 0; i < CrossRefList.size(); ++i) {
    if (!LoadCrossRefV4(CrossRefList[i], XRefStreamList[i], false))
      return false;
    // 只验证最新的XRef表
    if (i == 0 && !VerifyCrossRefV4())
      return false;
  }
  return true;
}

第五部分：LoadCrossRefV4 - 解析单个XRef表

PDF文件结构：XRef表详细格式

pdf 复制代码

xref                    ← 关键字
0 5                     ← 子节1：起始对象0，共5个对象
0000000000 65535 f      ← 条目0：空闲对象（固定）
0000000016 00000 n      ← 条目1：偏移16，生成号0
0000000081 00000 n      ← 条目2：偏移81，生成号0
0000000146 00000 n      ← 条目3：偏移146，生成号0
0000000220 00000 n      ← 条目4：偏移220，生成号0

5 3                     ← 子节2：起始对象5，共3个对象
0000000330 00000 n      ← 条目5
0000000388 00000 n      ← 条目6
0000000430 00000 n      ← 条目7

trailer                 ← trailer关键字
<<                      ← 字典开始
/Size 8
/Root 1 0 R
>>

条目格式详解（20字节）

复制代码

位置    示例      说明
0-9     "0000000016"  10位十进制偏移量，右对齐，前导空格
10      空格
11-15   "00000"       5位生成号
16      空格
17      "n"           状态：n=正常使用，f=空闲
18-19   "\r\n"        换行符（可以是\n或\r\n）

代码解析步骤

cpp 复制代码

bool CPDF_Parser::LoadCrossRefV4(FX_FILESIZE pos,
                                 FX_FILESIZE streampos,
                                 bool bSkip) {
  // 步骤1：定位到XRef表起始位置
  m_pSyntax->SetPos(pos);
  
  // 步骤2：验证关键字
  if (m_pSyntax->GetKeyword() != "xref")
    return false;

  // 步骤3：记录位置（用于后续排序）
  m_SortedOffset.insert(pos);
  if (streampos)
    m_SortedOffset.insert(streampos);

  // 步骤4：循环解析所有子节
  while (1) {
    FX_FILESIZE SavedPos = m_pSyntax->GetPos();
    bool bIsNumber;
    CFX_ByteString word = m_pSyntax->GetNextWord(&bIsNumber);
    if (word.IsEmpty())
      return false;

    // 如果不是数字，说明子节解析完毕，遇到trailer了
    if (!bIsNumber) {
      m_pSyntax->SetPos(SavedPos);
      break;
    }

    // 步骤5：解析子节头
    uint32_t start_objnum = FXSYS_atoui(word.c_str());  // 起始对象号
    if (start_objnum >= kMaxObjectNumber)
      return false;

    uint32_t count = m_pSyntax->GetDirectNum();         // 本子节对象数量
    m_pSyntax->ToNextWord();  // 跳过空格
    SavedPos = m_pSyntax->GetPos();  // 条目数据起始位置
    const int32_t recordsize = 20;   // 每个条目固定20字节

    m_dwXrefStartObjNum = start_objnum;
    
    // 步骤6：如果不是跳过模式，则解析条目数据
    if (!bSkip) {
      // 使用1024条目的缓冲区提高效率
      std::vector<char> buf(1024 * recordsize + 1);
      buf[1024 * recordsize] = '\0';

      int32_t nBlocks = count / 1024 + 1;
      for (int32_t block = 0; block < nBlocks; block++) {
        int32_t block_size = block == nBlocks - 1 ? count % 1024 : 1024;
        
        // 读取一个数据块
        m_pSyntax->ReadBlock(reinterpret_cast<uint8_t*>(buf.data()),
                             block_size * recordsize);

        // 遍历块中的每个条目
        for (int32_t i = 0; i < block_size; i++) {
          uint32_t objnum = start_objnum + block * 1024 + i;
          char* pEntry = &buf[i * recordsize];
          
          // 步骤7：检查条目状态（第18个字符，索引17）
          if (pEntry[17] == 'f') {
            // 空闲对象
            m_ObjectInfo[objnum].pos = 0;
            m_ObjectInfo[objnum].type = 0;
          } else {
            // 正常对象
            // 步骤8：解析偏移量（前10字节）
            FX_FILESIZE offset = (FX_FILESIZE)FXSYS_atoi64(pEntry);
            
            // 步骤9：验证偏移为0时的特殊情况
            if (offset == 0) {
              for (int32_t c = 0; c < 10; c++) {
                if (!std::isdigit(pEntry[c]))
                  return false;
              }
            }

            // 步骤10：记录对象信息
            m_ObjectInfo[objnum].pos = offset;
            
            // 步骤11：解析生成号（第12-16字节，索引11-15）
            int32_t version = FXSYS_atoi(pEntry + 11);
            if (version >= 1)
              m_bVersionUpdated = true;
            m_ObjectInfo[objnum].gennum = version;
            
            // 步骤12：记录有效偏移到排序集合
            if (m_ObjectInfo[objnum].pos < m_pSyntax->m_FileLen)
              m_SortedOffset.insert(m_ObjectInfo[objnum].pos);
            
            m_ObjectInfo[objnum].type = 1;
          }
        }
      }
    }
    // 步骤13：移动到下一个子节
    m_pSyntax->SetPos(SavedPos + count * recordsize);
  }
  
  // 步骤14：如果有XRef流，加载V5格式
  return !streampos || LoadCrossRefV5(&streampos, false);
}

第六部分：LoadAllCrossRefV5 - 加载XRef流格式

PDF文件结构：XRef流

XRef流是PDF 1.5引入的压缩格式，用于大文件：

pdf 复制代码

1 0 obj                        ← XRef流对象
<<
/Type /XRef                    ← 类型
/Size 10                       ← 对象总数
/W [1 4 1]                     ← 每个条目的字段宽度：类型1字节，偏移4字节，生成号1字节
/Index [0 5 6 4]              ← 子节：对象0-4，对象6-9（对象5被删除）
/Prev 12345                    ← 上一版本位置
/Root 2 0 R
>>
stream                         ← 流数据开始
┌─────────────────────────────┐
│ 0x01 0x00000100 0x00        │ ← 对象0：类型1，偏移256，生成号0
│ 0x02 0x00000001 0x00        │ ← 对象1：类型2，所在对象流编号1
│ 0x01 0x00000200 0x00        │ ← 对象2：类型1，偏移512
│ 0x00 0x00000000 0x00        │ ← 对象3：空闲对象
│ ...                         │
└─────────────────────────────┘
endstream
endobj

字段含义

W数组 /W [1 4 1]：

字段1（类型）：1字节，0=空闲，1=未压缩对象，2=压缩对象
字段2（数据）：4字节，类型1时=文件偏移，类型2时=对象流编号
字段3（生成号）：1字节

Index数组 ：/Index [起始1 数量1 起始2 数量2 ...]

指定哪些对象号范围有条目
不连续的对象号不需要存储，节省空间

代码解析步骤

cpp 复制代码

bool CPDF_Parser::LoadCrossRefV5(FX_FILESIZE* pos, bool bMainXRef) {
  // 步骤1：解析XRef流对象
  std::unique_ptr<CPDF_Object> pObject(
      ParseIndirectObjectAt(m_pDocument, *pos, 0));
  if (!pObject)
    return false;

  uint32_t objnum = pObject->m_ObjNum;
  if (!objnum)
    return false;

  // 步骤2：更新文档中的对象（如果生成号更高）
  CPDF_Object* pUnownedObject = pObject.get();
  if (m_pDocument) {
    CPDF_Dictionary* pRootDict = m_pDocument->GetRoot();
    if (pRootDict && pRootDict->GetObjNum() == objnum)
      return false;
    if (!m_pDocument->ReplaceIndirectObjectIfHigherGeneration(
            objnum, std::move(pObject))) {
      return false;
    }
  }

  // 步骤3：转换为流对象
  CPDF_Stream* pStream = pUnownedObject->AsStream();
  if (!pStream)
    return false;

  // 步骤4：读取字典中的关键字段
  CPDF_Dictionary* pDict = pStream->GetDict();
  *pos = pDict->GetIntegerFor("Prev");  // 更新位置为上一版本
  int32_t size = pDict->GetIntegerFor("Size");
  if (size < 0)
    return false;

  // 步骤5：处理trailer
  std::unique_ptr<CPDF_Dictionary> pNewTrailer = ToDictionary(pDict->Clone());
  if (bMainXRef) {
    m_pTrailer = std::move(pNewTrailer);
    ShrinkObjectMap(size);  // 收缩对象表
    // 重置所有对象类型
    for (auto& it : m_ObjectInfo)
      it.second.type = 0;
  } else {
    m_Trailers.push_back(std::move(pNewTrailer));
  }

  // 步骤6：解析Index数组（子节定义）
  std::vector<std::pair<int32_t, int32_t>> arrIndex;
  CPDF_Array* pArray = pDict->GetArrayFor("Index");
  if (pArray) {
    for (size_t i = 0; i < pArray->GetCount() / 2; i++) {
      CPDF_Object* pStartNumObj = pArray->GetObjectAt(i * 2);
      CPDF_Object* pCountObj = pArray->GetObjectAt(i * 2 + 1);
      if (ToNumber(pStartNumObj) && ToNumber(pCountObj)) {
        int nStartNum = pStartNumObj->GetInteger();
        int nCount = pCountObj->GetInteger();
        if (nStartNum >= 0 && nCount > 0)
          arrIndex.push_back(std::make_pair(nStartNum, nCount));
      }
    }
  }

  // 如果没有Index，默认从0到size
  if (arrIndex.size() == 0)
    arrIndex.push_back(std::make_pair(0, size));

  // 步骤7：解析W数组（字段宽度）
  pArray = pDict->GetArrayFor("W");
  if (!pArray)
    return false;

  std::vector<uint32_t> WidthArray;
  FX_SAFE_UINT32 dwAccWidth = 0;
  for (size_t i = 0; i < pArray->GetCount(); ++i) {
    WidthArray.push_back(pArray->GetIntegerAt(i));
    dwAccWidth += WidthArray[i];
  }

  if (!dwAccWidth.IsValid() || WidthArray.size() < 3)
    return false;

  uint32_t totalWidth = dwAccWidth.ValueOrDie();

  // 步骤8：加载流数据
  auto pAcc = pdfium::MakeRetain<CPDF_StreamAcc>(pStream);
  pAcc->LoadAllData();

  const uint8_t* pData = pAcc->GetData();
  uint32_t dwTotalSize = pAcc->GetSize();
  uint32_t segindex = 0;  // 当前处理的条目索引

  // 步骤9：遍历所有子节
  for (uint32_t i = 0; i < arrIndex.size(); i++) {
    int32_t startnum = arrIndex[i].first;
    if (startnum < 0)
      continue;

    m_dwXrefStartObjNum = pdfium::base::checked_cast<uint32_t>(startnum);
    uint32_t count = pdfium::base::checked_cast<uint32_t>(arrIndex[i].second);
    
    // 验证数据范围
    FX_SAFE_UINT32 dwCaculatedSize = segindex;
    dwCaculatedSize += count;
    dwCaculatedSize *= totalWidth;
    if (!dwCaculatedSize.IsValid() ||
        dwCaculatedSize.ValueOrDie() > dwTotalSize) {
      continue;
    }

    const uint8_t* segstart = pData + segindex * totalWidth;
    
    // 步骤10：遍历子节中的每个条目
    for (uint32_t j = 0; j < count; j++) {
      int32_t type = 1;
      const uint8_t* entrystart = segstart + j * totalWidth;
      
      // 读取类型字段
      if (WidthArray[0])
        type = GetVarInt(entrystart, WidthArray[0]);

      // 如果对象类型已经是255（压缩流对象），跳过
      if (GetObjectType(startnum + j) == 255) {
        FX_FILESIZE offset =
            GetVarInt(entrystart + WidthArray[0], WidthArray[1]);
        m_ObjectInfo[startnum + j].pos = offset;
        m_SortedOffset.insert(offset);
        continue;
      }

      // 如果对象已存在，跳过（不覆盖旧的）
      if (GetObjectType(startnum + j))
        continue;

      m_ObjectInfo[startnum + j].type = type;
      
      if (type == 0) {
        // 空闲对象
        m_ObjectInfo[startnum + j].pos = 0;
      } else {
        // 读取数据字段（偏移或对象流编号）
        FX_FILESIZE offset =
            GetVarInt(entrystart + WidthArray[0], WidthArray[1]);
        m_ObjectInfo[startnum + j].pos = offset;
        
        if (type == 1) {
          // 类型1：未压缩对象，记录偏移
          m_SortedOffset.insert(offset);
        } else if (type == 2) {
          // 类型2：压缩对象，offset是对象流编号
          if (offset < 0 || !IsValidObjectNumber(offset))
            return false;
          // 标记该对象流编号对应的对象为压缩对象容器
          m_ObjectInfo[offset].type = 255;
        }
      }
    }
    segindex += count;
  }
  return true;
}

第七部分：GetVarInt - 变长整数读取

cpp 复制代码

uint32_t GetVarInt(const uint8_t* p, int32_t n) {
  uint32_t result = 0;
  for (int32_t i = 0; i < n; ++i)
    result = result * 256 + p[i];
  return result;
}

这个函数将指定字节数的大端序整数转换为uint32_t。

示例：

p = [0x00, 0x01, 0x00], n=3 → 0*256*256 + 1*256 + 0 = 256
p = [0x01, 0x02], n=2 → 1*256 + 2 = 258

第八部分：对象映射表（m_ObjectInfo）

解析完成后，m_ObjectInfo 存储了所有对象的信息：

cpp 复制代码

struct {
  FX_FILESIZE pos;    // 文件偏移（类型1）或对象流编号（类型2）
  uint16_t gennum;    // 生成号
  uint8_t type;       // 0=空闲,1=正常,2=压缩,255=压缩流
} ObjectInfo;

std::map<uint32_t, ObjectInfo> m_ObjectInfo;

对象类型示例

ObjNum	pos	gennum	type	含义
0	0	65535	0	空闲对象（固定）
1	16	0	1	正常对象，位于文件偏移16
2	81	0	1	正常对象，位于文件偏移81
5	1	0	2	压缩对象，位于对象流1中
1(流)	100	0	255	对象流1，位于偏移100

第九部分：完整解析流程示例

假设有一个PDF文件：

复制代码

文件偏移    内容
0:         %PDF-1.4
10:        1 0 obj
20:        << /Type /Catalog /Pages 2 0 R >>
80:        endobj
90:        2 0 obj
100:       << /Type /Pages /Kids [3 0 R] /Count 1 >>
150:       endobj
200:       xref
210:       0 3
220:       0000000000 65535 f
230:       0000000010 00000 n
240:       0000000090 00000 n
250:       trailer
260:       << /Size 3 /Root 1 0 R >>
280:       startxref
290:       200
300:       %%EOF

解析步骤：

StartParse 搜索"startxref"，找到偏移280
读取后面的数字"200"，知道XRef表在偏移200
LoadAllCrossRefV4(200)：
- 定位到200，读取"xref"
- 解析子节："0 3" → 对象0-2
- 读取3个20字节条目，建立映射：
  - 对象0: pos=0, type=0
  - 对象1: pos=10, type=1
  - 对象2: pos=90, type=1
- 读取trailer，获取/Prev（0，结束）
后续根据映射表，可以通过对象号快速定位：
- GetObjectOffset(1) → 10
- GetObjectOffset(2) → 90

这样，解析器就完成了PDF文件的索引构建，后续可以随机访问任何对象。