【源码阅读】olmocr中的prompts

一、PDF转换为训练数据

让 ChatGPT-4 将文档（如 PDF 文件）转换为结构化的"银级"训练数据（silver training data）

bash 复制代码

# This is the prompt we use for getting chat gpt 4o to convert documents into our silver training data
def build_openai_silver_data_prompt(base_text: str) -> str:
    return (
        f"Below is the image of one page of a PDF document, as well as some raw textual content that was previously extracted for it that includes position information for each image and block of text (The origin [0x0] of the coordinates is in the lower left corner of the image). "
        f"Just return the plain text representation of this document as if you were reading it naturally.\n"
        f"Turn equations into a LaTeX representation, and tables into markdown format. Remove the headers and footers, but keep references and footnotes.\n"
        f"Read any natural handwriting.\n"
        f"This is likely one page out of several in the document, so be sure to preserve any sentences that come from the previous page, or continue onto the next page, exactly as they are.\n"
        f"If there is no text at all that you think you should read, you can output null.\n"
        f"Do not hallucinate.\n"
        f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"
    )

以下是 PDF 文档某一页的图像，以及之前为该页提取的一些原始文本内容，其中包括每个图像和文本块的位置信息（坐标的起始点 [0x0] 位于图像的左下角）。请将此文档的纯文本表示形式原原本本地返回，就好像您是在自然状态下阅读它一样。将方程转换为 LaTeX 表示形式，将表格转换为 Markdown 格式。删除页眉和页脚，但保留引用和脚注。读取任何自然手写体。这很可能是文档中若干页中的一页，因此请务必保留来自前一页的任何句子，或者如果需要的话直接转到下一页，保持其原样。如果您认为文档中没有任何您认为应该阅读的文本，您可以输出 null。不要产生幻觉。

提示词写的很完整，清晰且具体，这样才能将模型输出得到的是几乎和原pdf一样格式正确的markdown

二、使用微调模型

bash 复制代码

# This is a base prompt that will be used for training and running the fine tuned model
# It's simplified from the prompt which was used to generate the silver data, and can change from dataset to dataset
def build_finetuning_prompt(base_text: str) -> str:
    return (
        f"Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. "
        f"Just return the plain text representation of this document as if you were reading it naturally.\n"
        f"Do not hallucinate.\n"
        f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"
    )

生成一个基础提示（prompt），该提示将用于**微调模型（fine-tuning）**或在微调后的模型上运行任务
提示内容详解：生成的提示包含以下指令：告诉模型，输入是一个文档的某一页的图像及其提取的原始文本内容。返回文档的自然阅读文本。明确要求模型不要生成虚假内容（即不要"幻觉"）。使用 RAW_TEXT_START 和 RAW_TEXT_END 标记包裹原始文本内容，以便模型识别。

这应该是一个基础的prompt，不用于具体的执行步骤

三、比较差异

让模型比较两个不同模型从同一文档页面提取的文本内容，并找出差异，同时判断哪个模型的提取结果更准确。

bash 复制代码

def build_find_difference_prompt(base_text: str) -> str:
    return (
        f"Below is an image of a document page, along with raw textual content previously extracted using different models."
        f"Your goal is to carefully identify the differences between the extracted texts from both models and determine which one is more accurate by comparing them with the image."
        f"Only return the differences and specify which model extracted the text with higher accuracy.\n"
        f"Do not hallucinate.\n"
        f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"
    )

输入是一个文档页面的图像以及两段从不同模型提取的原始文本内容。仔细比较两段提取文本的差异。通过对比图像，确定哪一段文本提取得更准确。只输出两段文本之间的差异。明确指出哪个模型的提取结果更准确。明确要求模型不要生成虚假内容（即不要"幻觉"）。使用 RAW_TEXT_START 和 RAW_TEXT_END 标记包裹原始文本内容，以便模型识别。

总结

这个提示词确实一步步写的清晰具体，而且给了足够的思考过程，有一个思维链的感觉，让他慢慢地按步骤去推理。然后给的背景信息放在最后面

【源码阅读】olmocr中的prompts

目录

一、PDF转换为训练数据

二、使用微调模型

三、比较差异

总结