【pdf2md-3:实现揭秘】福昕PDF SDK Python 开发实战：从逐字符提取到 LR 版面分析

前两篇分别展示了 PDF 转 Markdown 工具的效果和架构设计。这篇聚焦 Foxit PDF SDK 本身------逐字符文本提取怎么做、LR 版面分析模块怎么用、实际开发中有哪些 API 注意事项和坑。如果你打算用 Foxit SDK 做 PDF 处理（不限于转 Markdown），这篇可以当作一个实战参考。

在线体验： 点击立即体验 PDF 转 Markdown
源码： https://github.com/AmyLin2013/pdf2md

一、SDK 初始化与文档加载

1.1 初始化 SDK

Foxit PDF SDK 需要用 License 序列号和密钥进行初始化，全局只需一次：

python 复制代码

import FoxitPDFSDKPython3 as fsdk

error_code = fsdk.Library.Initialize(sn, key)
if error_code != fsdk.e_ErrSuccess:
    raise RuntimeError(f"SDK initialization failed: {error_code}")

在我的项目中用了一个 _sdk_initialized 标志位做惰性初始化，避免重复调用：

python 复制代码

_sdk_initialized = False

def _ensure_sdk():
    global _sdk_initialized
    if not _sdk_initialized:
        error_code = fsdk.Library.Initialize(FOXIT_SN, FOXIT_KEY)
        if error_code != fsdk.e_ErrSuccess:
            raise RuntimeError(f"Failed to initialize Foxit PDF SDK. Error code: {error_code}")
        _sdk_initialized = True

1.2 加载文档

python 复制代码

doc = fsdk.PDFDoc(pdf_path)
error_code = doc.Load("")  # 空密码

if error_code != fsdk.e_ErrSuccess:
    if error_code == fsdk.e_ErrPassword:
        raise ValueError("PDF is password-protected.")
    raise RuntimeError(f"Failed to load PDF. Error code: {error_code}")

Load("") 传空字符串表示无密码。如果 PDF 加密了，会返回 e_ErrPassword。

1.3 获取页面并解析

python 复制代码

page_count = doc.GetPageCount()

for page_idx in range(page_count):
    page = doc.GetPage(page_idx)
    # 这一步很重要------必须先 StartParse 才能正确使用 page
    page.StartParse(fsdk.PDFPage.e_ParsePageNormal, None, False)

    width = page.GetWidth()
    height = page.GetHeight()

注意：page.StartParse() 是必须调用的。如果跳过这一步直接使用 page 对象，后续获取图形对象、创建 TextPage 等操作可能会失败或返回空数据。

二、元数据与书签提取

2.1 文档元数据

python 复制代码

metadata = fsdk.Metadata(doc)
title = metadata.GetValue("Title")    # 文档标题
author = metadata.GetValue("Author")  # 作者

GetValue() 支持的字段名包括 Title、Author、Subject、Keywords、Creator、Producer 等标准 PDF 元数据字段。

2.2 书签（大纲）提取

PDF 书签是一个树形结构，需要递归遍历：

python 复制代码

root_bookmark = doc.GetRootBookmark()

if not root_bookmark.IsEmpty():
    first_child = root_bookmark.GetFirstChild()
    bookmarks = extract_bookmarks(first_child, doc, level=0)

递归遍历的模式：GetFirstChild() 进入子层级，GetNextSibling() 遍历同级：

python 复制代码

def extract_bookmarks(bookmark, doc, level=0):
    items = []
    current = bookmark

    while True:
        if current.IsEmpty():
            break

        title = current.GetTitle()

        # 获取书签指向的页码
        page_index = -1
        dest = current.GetDestination()
        if not dest.IsEmpty():
            page_index = dest.GetPageIndex(doc)  # ⚠️ SDK 11.0 需传 doc

        item = BookmarkItem(title=title, level=level, page_index=page_index)

        # 递归子节点
        first_child = current.GetFirstChild()
        if not first_child.IsEmpty():
            item.children = extract_bookmarks(first_child, doc, level + 1)

        items.append(item)

        # 下一个兄弟节点
        current = current.GetNextSibling()
        if current.IsEmpty():
            break

    return items

⚠️ SDK 11.0 注意 ：Destination.GetPageIndex() 在 SDK 11.0 中需要传入 doc 对象作为参数（dest.GetPageIndex(doc)）

三、逐字符文本提取------精度的代价

3.1 为什么不用 GetText()

Foxit SDK 提供了 TextPage.GetText() 可以一次性获取整页文本。但这样你只拿到了纯文字，丢失了所有样式信息。

而 PDF 转 Markdown 需要知道每个字的字号、字体、是否粗体、精确坐标------这些信息是标题检测、段落合并、页眉页脚过滤的基础。所以必须逐字符遍历。

3.2 逐字符遍历 API

python 复制代码

text_page = fsdk.TextPage(page, fsdk.TextPage.e_ParseTextNormal)
char_count = text_page.GetCharCount()

for i in range(char_count):
    # 获取字符元信息
    char_info = text_page.GetCharInfo(i)

    # 获取字符文本
    ch = text_page.GetChars(i, 1)

    # 字体信息
    font = char_info.font
    font_size = char_info.font_size

    if not font.IsEmpty():
        font_name = font.GetName()
        is_bold = font.IsBold()
        is_italic = font.IsItalic()

    # 精确坐标（left, bottom, right, top）
    char_box = char_info.char_box
    left = char_box.left
    bottom = char_box.bottom
    right = char_box.right
    top = char_box.top

每个字符都能拿到完整的属性。char_box 给出了字符在页面上的精确矩形坐标（单位是 PDF 点，1 点 = 1/72 英寸），坐标系是左下角为原点。

3.3 把字符聚合为 TextBlock

逐字符遍历拿到的是原子级数据，需要把相邻的、样式一致 的字符聚合成 TextBlock。以下是 _extract_text_blocks() 核心循环的精选代码，展示一个完整字符处理周期内各步骤如何协作：

python 复制代码

# ---- pdf_parser.py: _extract_text_blocks() 核心循环 ----

for i in range(char_count):
    char_info = text_page.GetCharInfo(i)

    # ① 跳过表格区域内的字符（表格由独立函数处理）
    if exclude_bboxes:
        cbox = char_info.char_box
        cx = (cbox.left + cbox.right) / 2.0
        cy = (cbox.bottom + cbox.top) / 2.0
        if _point_in_any_table(cx, cy, exclude_bboxes):
            continue

    # ② 获取字符文本和字体信息
    ch = text_page.GetChars(i, 1)
    font = char_info.font
    font_size = char_info.font_size
    is_bold = font.IsBold() if not font.IsEmpty() else False

    # ③ 更新当前块的 bounding box
    cbox = char_info.char_box
    current_bbox[0] = min(current_bbox[0], cbox.left)
    current_bbox[1] = min(current_bbox[1], cbox.bottom)
    current_bbox[2] = max(current_bbox[2], cbox.right)
    current_bbox[3] = max(current_bbox[3], cbox.top)

    # ④ 上标/下标检测（不触发分块）
    ref_size = dominant_font_size if dominant_font_size > 0 else current_font_size
    is_super_or_sub = (
        ref_size > 0 and font_size > 0 and
        font_size < ref_size * 0.82
    )

    # ⑤ 结构性样式变化检测（触发分块）
    style_changed = False
    if current_text and not is_super_or_sub:
        size_diff = abs(font_size - ref_size) if ref_size > 0 else 0
        if size_diff > 2.0 and font_size > ref_size * 1.2:
            style_changed = True            # 字号显著增大 → 标题边界
        elif is_bold != current_is_bold and size_diff > 2.0:
            style_changed = True            # 粗体变化 + 字号差 → 结构边界

    # ⑥ 换行 → 立即分块
    if ch in ('\n', '\r'):
        _flush_block()
        continue

    # ⑦ 样式变化 → 分块后开始新块
    if style_changed:
        _flush_block()

    current_text += ch

    # ⑧ 只有非上标字符才更新块的"主导样式"
    if not is_super_or_sub:
        current_font_size = font_size
        current_is_bold = is_bold
        if font_size > dominant_font_size:
            dominant_font_size = font_size

_flush_block()  # 输出最后一个块

这个循环的设计要点是：上标/下标（第④步）不触发分块，只有结构性变化（第⑤步）和换行（第⑥步）才会。 这样"参考文献 $1$ "中的上标角注不会把一个文本块打碎。

3.4 排除表格区域

文本提取时需要跳过表格区域的字符（表格内容由独立的表格提取函数处理），否则会导致内容重复：

python 复制代码

# 检查字符中心是否落在表格区域内
cx = (char_box.left + char_box.right) / 2.0
cy = (char_box.bottom + char_box.top) / 2.0

for (left, bottom, right, top) in table_bboxes:
    if left - 2 <= cx <= right + 2 and bottom - 2 <= cy <= top + 2:
        # 在表格内，跳过这个字符
        continue

用字符中心点 做判断（而非整个 char_box），可以避免边缘字符被误排除。±2 的容差处理边界情况。

四、图片提取

图片提取走的是 GraphicsObject 路径，和文本完全不同：

python 复制代码

pos = page.GetFirstGraphicsObjectPosition(fsdk.GraphicsObject.e_TypeImage)

while pos:
    gfx_obj = page.GetGraphicsObject(pos)
    pos = page.GetNextGraphicsObjectPosition(pos, fsdk.GraphicsObject.e_TypeImage)

    # 获取 ImageObject
    img_obj = gfx_obj.GetImageObject()

    # 获取位图
    bitmap = img_obj.CloneBitmap(page)  # ⚠️ SDK 11.0 注意

    if bitmap is None or bitmap.IsEmpty():
        continue

    w = bitmap.GetWidth()
    h = bitmap.GetHeight()

    # 跳过小于 20×20 的装饰性小图标
    if w < 20 or h < 20:
        continue

    # 获取图片在页面上的位置
    rect = gfx_obj.GetRect()
    bbox = (rect.left, rect.bottom, rect.right, rect.top)

    # 保存为 PNG
    image = fsdk.Image()
    image.AddFrame(bitmap)
    image.SaveAs(img_path)

⚠️ 关于图片获取：

项目相关接口

获取位图 img_obj.CloneBitmap(page)

创建 ImageObject fsdk.ImageObject(gfx_obj)

CloneBitmap 参数需要传 GraphicsObjects 对象

关于 CloneBitmap(page) 的参数：CloneBitmap() 需要一个 GraphicsObjects 对象，而 PDFPage 继承自 GraphicsObjects，所以直接传 page 就行。这个继承关系在 SDK 文档中不太显眼，刚开始使用时容易迷惑。
图片提取流程：GetFirstGraphicsObjectPosition → 循环 GetGraphicsObject → GetImageObject → CloneBitmap(page) → 尺寸过滤 → SaveAs。

项目	相关接口
获取位图	`img_obj.CloneBitmap(page)`
创建 ImageObject	`fsdk.ImageObject(gfx_obj)`
CloneBitmap 参数	需要传 `GraphicsObjects` 对象

五、超链接提取

python 复制代码

text_page = fsdk.TextPage(page, fsdk.TextPage.e_ParseTextNormal)
page_links = fsdk.PageTextLinks(text_page)

link_count = page_links.GetTextLinkCount()
for i in range(link_count):
    text_link = page_links.GetTextLink(i)

    if text_link.IsEmpty():
        continue

    url = text_link.GetURI()

    # 获取链接文本的起止索引
    start = text_link.GetStartCharIndex()
    end = text_link.GetEndCharIndex()

    # 用索引范围提取链接文本
    text = text_page.GetChars(start, end - start + 1)

PageTextLinks 会自动识别文本中的 URL 模式（http://、www. 等），返回链接的 URL 和在 TextPage 中的字符索引范围。

注意 PageTextLinks 只识别文本中的 URL 模式 ，不会识别 PDF 注释型的链接（Annotation Link）。如果需要提取注释链接，需要通过 page.GetAnnot() 系列 API。

六、LR 版面分析模块详解

这是 Foxit SDK 中对 PDF 转 Markdown 最有价值的模块------Layout Recognition（LR）。它能对页面进行版面分析，输出一棵结构化元素树，标注哪里是表格、哪里是标题、哪里是段落。

6.1 初始化与解析

python 复制代码

ctx = fsdk.LRContext(page)
ctx.StartParse()  # 执行版面分析（同步）
root = ctx.GetRootElement()

if root.IsEmpty():
    return  # 该页无 LR 结果

StartParse() 是同步调用，会阻塞直到分析完成。对于复杂页面可能需要几十毫秒。

6.2 元素类型

LR 输出的每个元素都有一个类型（GetElementType()），核心类型包括：

类型常量	含义
`e_ElementTypeTable`	表格
`e_ElementTypeTableRow`	表格行（TR）
`e_ElementTypeTableDataCell`	数据单元格（TD）
`e_ElementTypeTableHeaderCell`	表头单元格（TH）
`e_ElementTypeTableBodyGroup`	表体组（TBody）
`e_ElementTypeTableHeaderGroup`	表头组（THead）
`e_ElementTypeTableFootGroup`	表footer组（TFoot）
`e_ElementTypeParagraph`	段落
`e_ElementTypeHeading`	标题（通用）
`e_ElementTypeHeading1` ~ `e_ElementTypeHeading6`	标题 H1 ~ H6
`e_ElementTypeHeadingN`	非标准层级标题
`e_ElementTypeTitle`	文档标题

6.3 递归遍历元素树

LR 的输出是一棵树，需要递归遍历。关键设计决策：遇到 Table 元素直接收集并 return，不递归进内部。 表格内部的行列结构由独立的表格提取函数处理：

python 复制代码

def collect_lr_elements(elem, tables, headings, paragraphs):
    et = elem.GetElementType()

    if et == e_ElementTypeTable:
        tables.append(elem)
        return  # ⚠️ 不递归进表格内部

    if et in HEADING_TYPES:
        headings.append(elem)
        return

    if et == e_ElementTypeParagraph:
        se = fsdk.LRStructureElement(elem)
        bbox = se.GetBBox()
        paragraphs.append((bbox.left, bbox.bottom, bbox.right, bbox.top))
        return

    # 其他容器元素：递归子节点
    if elem.IsStructureElement():
        se = fsdk.LRStructureElement(elem)
        for i in range(se.GetChildCount()):
            child = se.GetChild(i)
            if not child.IsEmpty():
                collect_lr_elements(child, tables, headings, paragraphs)

为什么不递归进 Table 内部？因为 Table 的子元素（TR、TD）如果被单独收集，可能被误归类到 heading 或 paragraph。表格数据需要维持行-列结构，所以由专门的 _extract_lr_table() 函数按层级处理。

📷 【图 4：LR 元素树结构】 树形图展示 LR 元素的层级关系：

复制代码

Root Element
├── Table
│   ├── THead
│   │   └── TR → TH, TH, TH
│   └── TBody
│       ├── TR → TD, TD, TD
│       └── TR → TD, TD, TD
├── Heading1 → "第一章 绪论"
├── Paragraph → BBox(10, 500, 400, 520)
├── Paragraph → BBox(10, 475, 400, 495)
└── Heading2 → "1.1 研究背景"

6.4 LRStructureElement------获取 BBox 和属性

LR 元素需要转为 LRStructureElement 才能访问丰富的属性：

python 复制代码

se = fsdk.LRStructureElement(elem)

# 获取边界框
bbox = se.GetBBox()  # 返回 RectF 对象，有 left, bottom, right, top 属性

# 获取子元素数量
child_count = se.GetChildCount()

# 获取子元素
child = se.GetChild(i)

6.5 表格单元格的 ColSpan / RowSpan 读取

合并单元格的信息通过 GetSupportedAttribute 枚举：

python 复制代码

cell_se = fsdk.LRStructureElement(cell_elem)

colspan = 1
rowspan = 1

for ai in range(cell_se.GetSupportedAttributeCount()):
    attr = cell_se.GetSupportedAttribute(ai)

    if attr == fsdk.LRStructureElement.e_AttributeTypeColSpan:
        colspan = cell_se.GetAttributeValueInt32(attr)

    elif attr == fsdk.LRStructureElement.e_AttributeTypeRowSpan:
        rowspan = cell_se.GetAttributeValueInt32(attr)

注意 GetSupportedAttributeCount() 返回的是该元素支持的 属性总数，不一定每个属性都有值。枚举后用 GetAttributeValueInt32() 读取整数值。

6.6 用 BBox 提取区域文本

LR 给了 BBox 之后，需要从 TextPage 中提取该区域内的文字。做法是遍历所有字符，检查中心点是否在 BBox 内：

python 复制代码

def text_in_rect(text_page, bbox, char_count):
    chars = []
    for i in range(char_count):
        info = text_page.GetCharInfo(i)
        cx = (info.char_box.left + info.char_box.right) / 2.0
        cy = (info.char_box.bottom + info.char_box.top) / 2.0

        if (bbox.left - 2 <= cx <= bbox.right + 2 and
                bbox.bottom - 2 <= cy <= bbox.top + 2):
            ch = text_page.GetChars(i, 1)
            if ch:
                chars.append(ch)

    return "".join(chars).strip()

这比直接用 GetText() 然后截取字符串要可靠得多------因为 GetText() 的文本顺序可能和视觉顺序不一致，尤其在多栏布局中。

七、SDK 11.0 踩坑清单

在实际开发中踩了不少 SDK 相关的坑，总结如下：

7.1 CloneBitmap vs GetBitmap

	早期版本	SDK 11.0
API	`img_obj.GetBitmap()`	`img_obj.CloneBitmap(page)`
参数	无	需要传入 `GraphicsObjects` 对象
说明	---	`PDFPage` 继承自 `GraphicsObjects`，直接传 `page`

如果用错了 API，会报属性不存在或参数错误。

7.2 GetPageIndex 需要 doc 参数

python 复制代码

# ❌ 早期版本写法
page_index = dest.GetPageIndex()

# ✅ SDK 11.0 写法
page_index = dest.GetPageIndex(doc)

7.3 ImageObject 的创建方式

python 复制代码

# 写法
img_obj = gfx_obj.GetImageObject()

7.4 静态析构函数导致 Crash

这是一个比较隐蔽的坑。Foxit SDK 的静态析构函数在进程正常退出时会触发，但在某些场景下（特别是和 Python 的 atexit 清理顺序冲突时）会导致进程以 0xC0000005（Access Violation）退出。

表象：脚本运行完输出了结果，但退出时报 crash。

影响：如果你的诊断脚本使用 print() 输出结果，crash 会导致 stdout buffer 未刷新------你看不到输出！

解决方案：

python 复制代码

import os

# 在需要输出的地方用文件写入代替 print
with open("output.txt", "w") as f:
    f.write(result)

# 或者用 os._exit(0) 强制退出（绕过析构函数，但也绕过 stdout flush）
# 如果用这种方式，一定要先 flush stdout
import sys
sys.stdout.flush()
os._exit(0)

在我的 Web 服务场景中不需要 workaround，因为 FastAPI 是长运行进程。但在写诊断脚本时被这个问题折腾了好一阵。

八、LR 模块的局限性与应对策略

LR 模块很强大，但不是万能的。以下是我遇到的三个典型局限和应对方式：

问题	表现	应对策略
伪表格	长标题换行后被误判为 1 行 2 列的表格	伪表格检测：单行少列 + 编号模式匹配 → 拒绝为表格，提升为标题
漏检表格	无边框表格完全未识别（0 tables）	启发式重建：表格标题定位 → 列网格聚类 → 自适应行分组
正文误判为标题	整页正文被标记为一个巨大的 Heading 元素	文本长度守卫：≥120 字符的 "标题" 视为误分类，回退为正文处理

核心应对原则：不要无条件信任 SDK 输出。 LR 模块的结果是一个很好的起点和信号源，但需要附加质量检验（后验校正）才能可靠使用。

九、完整的页面处理流程

把所有 SDK API 调用串起来，一个页面的处理流程是：

python 复制代码

# ---- pdf_parser.py: parse_pdf() 每页处理循环 ----

for page_idx in range(result.page_count):
    page = doc.GetPage(page_idx)
    page.StartParse(fsdk.PDFPage.e_ParsePageNormal, None, False)

    page_content = PageContent(page_index=page_idx)
    page_content.width = page.GetWidth()
    page_content.height = page.GetHeight()

    # ① LR 版面分析（最先做）--- 输出表格、标题、段落 BBox
    page_content.table_blocks, page_content.lr_headings, \
        page_content.lr_paragraphs = \
        _extract_tables_and_lr_headings(page, page_idx)
    table_bboxes = [tb.bbox for tb in page_content.table_blocks]

    # ② 文本提取（跳过表格区域，避免内容重复）
    text_blocks, raw_text = _extract_text_blocks(
        page, page_idx, exclude_bboxes=table_bboxes,
    )
    page_content.text_blocks = text_blocks

    # ③ 修复 LR 表格的行结构错误
    if page_content.table_blocks and page_content.text_blocks:
        page_content.table_blocks = _fix_misstructured_lr_tables(
            page, page_content.table_blocks,
            page_content.text_blocks, page_idx,
        )

    # ④ 启发式重建 LR 漏检的表格
    if page_content.text_blocks:
        new_tables, new_headings = _reconstruct_missed_tables(
            page_content.text_blocks,
            page_content.lr_headings,
            page_content.table_blocks,
            page_idx,
        )
        if len(new_tables) > len(page_content.table_blocks):
            # 移除落入新表格区域的文本块（避免重复）
            added_bboxes = [
                tb.bbox for tb in new_tables
                if tb not in page_content.table_blocks
            ]
            page_content.text_blocks = [
                blk for blk in page_content.text_blocks
                if not _point_in_any_table(
                    (blk.bbox[0]+blk.bbox[2])/2,
                    (blk.bbox[1]+blk.bbox[3])/2,
                    added_bboxes,
                )
            ]
            page_content.table_blocks = new_tables
            page_content.lr_headings = new_headings

    # ⑤ 图片提取
    page_content.image_blocks = _extract_images(page, page_idx, pdf_name)

    # ⑥ 链接提取
    tp = fsdk.TextPage(page, fsdk.TextPage.e_ParseTextNormal)
    page_content.link_blocks = _extract_links(page, tp, page_idx)

    result.pages.append(page_content)

执行顺序很重要：LR 版面分析（①）必须在文本提取（②）之前完成，这样文本提取时才能知道哪些区域是表格需要排除。后处理步骤③④在两者之后运行，既有表格数据又有文本数据可以交叉验证和修复。

十、实用代码片段参考

10.1 获取文档全部文本（快速版）

如果不需要字体信息，只要拿纯文本，用 GetText() 最快：

python 复制代码

for page_idx in range(doc.GetPageCount()):
    page = doc.GetPage(page_idx)
    page.StartParse(fsdk.PDFPage.e_ParsePageNormal, None, False)
    text_page = fsdk.TextPage(page, fsdk.TextPage.e_ParseTextNormal)
    text = text_page.GetText(fsdk.TextPage.e_TextDisplayOrder)
    print(f"--- Page {page_idx + 1} ---")
    print(text)

10.2 提取指定矩形区域的文本

python 复制代码

# 创建 RectF 对象
rect = fsdk.RectF()
rect.left = 100.0
rect.bottom = 200.0
rect.right = 400.0
rect.top = 500.0

# 方法 1：用 GetTextInRect（如果 SDK 版本支持）
# 方法 2：逐字符过滤（通用，本项目使用的方式）
result = text_in_rect(text_page, rect, text_page.GetCharCount())

10.3 判断字体是否为粗体/斜体

python 复制代码

char_info = text_page.GetCharInfo(i)
font = char_info.font

if not font.IsEmpty():
    is_bold = font.IsBold()
    is_italic = font.IsItalic()
    font_name = font.GetName()  # 如 "SimSun", "Arial-Bold"

注意有些 PDF 的粗体不是通过字体属性标记的（IsBold() 返回 False），而是在字体名称中包含 "Bold"（如 "TimesNewRoman-Bold"）。如果需要更准确的粗体判断，可以同时检查字体名称。

总结

本文覆盖了 Foxit PDF SDK Python 绑定中最常用的 API 和实战注意事项：

能力	核心 API
初始化	`Library.Initialize(sn, key)`
文档加载	`PDFDoc(path)` → `Load("")`
元数据	`Metadata(doc).GetValue("Title")`
书签	`GetRootBookmark()` → `GetFirstChild()` / `GetNextSibling()`
逐字符文本	`TextPage(page, ...)` → `GetCharInfo(i)` / `GetChars(i, 1)`
图片	`GetFirstGraphicsObjectPosition(e_TypeImage)` → `CloneBitmap(page)`
超链接	`PageTextLinks(text_page)` → `GetTextLink(i)`
LR 版面分析	`LRContext(page)` → `StartParse()` → `GetRootElement()`
LR 表格属性	`LRStructureElement` → `GetSupportedAttribute()` → `GetAttributeValueInt32()`

最后强调一点：SDK 是工具，不是答案。 LR 模块提供的版面分析结果是很好的起点，但在生产环境中一定要加上后验校正------伪表格检测、漏检表格重建、误分类标题守卫。只有 SDK 能力 + 业务逻辑校正结合在一起，才能达到可靠的效果。

注：本文是借助AI 生成的。

技术栈：Python 3.10 + Foxit PDF SDK 11.x + FastAPI + Jinja2

*本文为 PDF 转 Markdown 系列第 3 篇，[【pdf2md-1：开篇】高保真PDF转MarkDown附源码（标题/表格/图片全还原）| 【pdf2md-2:关键核心】PDF 转 Markdown 技术拆解：两阶段流水线、四级标题检测与段落智能合并