调用MinerU的API,实现PDF转markdown文件

文章目录


前言

这两天在学习、复现一些论文,涉及到把本地文献喂给大模型。小红书科研之后发现,直接把pdf喂给大模型不仅很浪费token,而且DOC/DOCX/PDF 格式也不是机器可读的。因此快速将这些文档转换为机器可读格式是非常重要的。

MinerU是上海人工智能创新中心 OpenDataLab 推出的将 PDF 转换为机器可读格式(例如 markdown、JSON)的大模型,获得了一致好评。这篇博文记录了我使用调用API批量转换文档的过程。后期有时间的话会考虑把MinerU大模型部署到课题组的本地服务器。


一、MinerU的API模式

MinerU 提供两种文档解析 API,满足不同场景需求:

  1. 精准解析 API:需申请 Token,支持单文件/批量、表格/公式/多格式输出;
  2. Agent 轻量解析 API:免登录,IP 限频防滥用,专为 AI Agent 工作流设计

和我一样需要解析学术论文的朋友,建议选择精准解析 API。具体的MinerU API文档,请阅读官网说明:MinerU API文档

二、如何调用API

1.注册、登录MinerU

首先进入MinerU官网,想获得精准解析 API,需要注册、登录。这里为了方便,我选择github账号登录。

进入API管理界面

2.申请API

点击创建token

给自己的API取一个名字,注意保存好token。

3.写一个脚本文件调用API

这是从博主码里奥Ziho的视频里面学到的方法(PDF导入obsidian,打造本地知识库)。使用大模型帮我们写脚本文件,提示词为

帮我开发一个脚本,能够批量转换pdf成markdown,要求:需要 Token;接口地址为/api/v4/extract/task 或 /api/v4/file-urls/batch;模型版本为pipeline (默认)/vlm(推荐)/MinerU-HTML;文件大小限制≤ 200MB;页数限制≤200页;批量支持(≤200个);输出格式Zip包,其中包含Markdown、JSON,且可导出为docx/html/latex;调用方式是异步(提交>轮询)。我的api是xxxxxx (这里填我们刚才申请的api)。

等待片刻后,大模型会返回一个写好的脚本文件。但是根据我的实践过程,第一次写好的脚步存在各种各样的问题,需要调试几轮才可以正常使用。我遇到了需要安装pypdf包的问题。

bash 复制代码
python -m pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pypdf

3.1 一个调用API解析本地论文的脚本文件

下面介绍一下我使用的batch_pdf_to_markdown.py文件,给大家一个参考。我在项目根目录下面建立了一个tools文件夹,存放batch_pdf_to_markdown.py文件。

python 复制代码
import http.client
import json
import os
import time
import urllib.error
import urllib.parse
import urllib.request
import zipfile
from pathlib import Path


BASE_URL = "https://mineru.net"
FILE_URLS_BATCH_ENDPOINT = "/api/v4/file-urls/batch"
URL_TASK_BATCH_ENDPOINT = "/api/v4/extract/task/batch"
BATCH_RESULTS_ENDPOINT = "/api/v4/extract-results/batch"

MAX_FILE_BYTES = 200 * 1024 * 1024
MAX_PDF_PAGES = 200
MAX_BATCH_FILES = 200

FINISHED_STATES = {"done", "failed"}
SUPPORTED_MODELS = {"pipeline", "vlm", "MinerU-HTML"}
SUPPORTED_EXPORT_FORMATS = {"docx", "html", "latex"}#可以转换的格式


def collect_pdf_paths(pdf_dir):
    if not pdf_dir.exists():
        raise FileNotFoundError(f"PDF 目录不存在: {pdf_dir}")
    if not pdf_dir.is_dir():
        raise NotADirectoryError(f"PDF_DIR 必须是目录: {pdf_dir}")

    return sorted(
        [path for path in pdf_dir.iterdir() if path.suffix.lower() == ".pdf"],
        key=lambda path: path.name.casefold(),
    )


def count_pdf_pages(pdf_path):
    try:
        from pypdf import PdfReader
    except ModuleNotFoundError as exc:
        raise RuntimeError("需要安装 pypdf 才能在上传前校验 PDF 页数: pip install pypdf") from exc

    reader = PdfReader(str(pdf_path))
    return len(reader.pages)


def validate_pdf_batch(pdf_paths):
    if not pdf_paths:
        raise ValueError("没有找到可上传的 PDF 文件。")
    if len(pdf_paths) > MAX_BATCH_FILES:
        raise ValueError(f"最多支持 200 个 PDF,当前数量: {len(pdf_paths)}")

    for pdf_path in pdf_paths:
        if not pdf_path.exists():
            raise FileNotFoundError(f"PDF 文件不存在: {pdf_path}")
        if pdf_path.suffix.lower() != ".pdf":
            raise ValueError(f"只支持 PDF 文件: {pdf_path}")

        file_size = pdf_path.stat().st_size
        if file_size > MAX_FILE_BYTES:
            raise ValueError(f"{pdf_path.name} 超过 200 MB: {file_size} bytes")

        page_count = count_pdf_pages(pdf_path)
        if page_count > MAX_PDF_PAGES:
            raise ValueError(f"{pdf_path.name} 超过 200 页: {page_count} 页")


def validate_model_version(model_version):
    if model_version not in SUPPORTED_MODELS:
        raise ValueError(f"model_version 必须是 {sorted(SUPPORTED_MODELS)} 之一,当前: {model_version}")


def validate_export_formats(export_formats):
    unknown_formats = set(export_formats) - SUPPORTED_EXPORT_FORMATS
    if unknown_formats:
        raise ValueError(f"extra_formats 只支持 docx/html/latex,当前不支持: {sorted(unknown_formats)}")


def build_extract_payload(
    urls,
    model_version,
    export_formats,
    enable_formula,
    enable_table,
    language="ch",
    no_cache=False,
):
    validate_model_version(model_version)
    validate_export_formats(export_formats)
    if len(urls) > MAX_BATCH_FILES:
        raise ValueError(f"最多支持 200 个 URL,当前数量: {len(urls)}")

    return {
        "enable_formula": enable_formula,
        "enable_table": enable_table,
        "language": language,
        "files": [{"url": url, "is_ocr": True, "data_id": f"{idx:04d}"} for idx, url in enumerate(urls, start=1)],
        "model_version": model_version,
        "extra_formats": list(export_formats),
        "no_cache": no_cache,
    }


def build_local_upload_payload(pdf_paths, model_version, export_formats, enable_formula, enable_table, language="ch"):
    validate_model_version(model_version)
    validate_export_formats(export_formats)

    files = [
        {"name": pdf_path.name, "data_id": f"{idx:04d}-{pdf_path.stem}", "is_ocr": True}
        for idx, pdf_path in enumerate(pdf_paths, start=1)
    ]
    return {
        "enable_formula": enable_formula,
        "enable_table": enable_table,
        "language": language,
        "files": files,
        "model_version": model_version,
        "extra_formats": list(export_formats),
    }


def mineru_headers(token, include_content_type=True):
    if not token:
        raise RuntimeError("未设置 MINERU_API_TOKEN 环境变量。")

    headers = {
        "Authorization": f"Bearer {token}",
        "Accept": "*/*",
    }
    if include_content_type:
        headers["Content-Type"] = "application/json"
    return headers


def load_env_value(env_path, key):
    if not env_path.exists():
        return None

    for line_number, raw_line in enumerate(env_path.read_text(encoding="utf-8").splitlines(), start=1):
        line = raw_line.strip()
        if not line or line.startswith("#"):
            continue
        if "=" not in line:
            raise ValueError(f"{env_path} 第 {line_number} 行格式错误,应为 KEY=VALUE。")

        current_key, current_value = line.split("=", 1)
        if current_key.strip() == key:
            value = current_value.strip()
            if (value.startswith('"') and value.endswith('"')) or (value.startswith("'") and value.endswith("'")):
                value = value[1:-1]
            return value

    return None


def get_api_token():
    '''
    获取API
    '''
    return os.environ.get("MINERU_API_TOKEN") or load_env_value(Path(".env"), "MINERU_API_TOKEN")


def request_json(method, endpoint, token, payload=None, timeout_seconds=60):
    url = urllib.parse.urljoin(BASE_URL, endpoint)
    data = None
    if payload is not None:
        data = json.dumps(payload, ensure_ascii=False).encode("utf-8")

    request = urllib.request.Request(
        url,
        data=data,
        headers=mineru_headers(token, include_content_type=payload is not None),
        method=method,
    )
    try:
        with urllib.request.urlopen(request, timeout=timeout_seconds) as response:
            response_body = response.read().decode("utf-8")
    except urllib.error.HTTPError as exc:
        error_body = exc.read().decode("utf-8", errors="replace")
        raise RuntimeError(f"HTTP {exc.code} 请求失败: {url}\n{error_body}") from exc

    return json.loads(response_body)


def ensure_success_response(response_json):
    if response_json.get("code") != 0:
        raise RuntimeError(f"MinerU API 返回错误: {response_json}")
    return response_json


def request_local_upload_urls(pdf_paths, token, model_version, export_formats, enable_formula, enable_table, language):
    payload = build_local_upload_payload(pdf_paths, model_version, export_formats, enable_formula, enable_table, language)
    response_json = request_json("POST", FILE_URLS_BATCH_ENDPOINT, token, payload)
    data = ensure_success_response(response_json)["data"]
    upload_urls = data["file_urls"]
    if len(upload_urls) != len(pdf_paths):
        raise RuntimeError(f"上传 URL 数量和 PDF 数量不一致: {len(upload_urls)} != {len(pdf_paths)}")
    return data["batch_id"], upload_urls


def submit_url_batch(urls, token, model_version, export_formats, enable_formula, enable_table, language, no_cache):
    payload = build_extract_payload(urls, model_version, export_formats, enable_formula, enable_table, language, no_cache)
    response_json = request_json("POST", URL_TASK_BATCH_ENDPOINT, token, payload)
    return ensure_success_response(response_json)["data"]["batch_id"]


def upload_bytes_to_signed_url(upload_url, file_bytes):
    parsed_url = urllib.parse.urlsplit(upload_url)
    if parsed_url.scheme not in {"http", "https"}:
        raise ValueError(f"签名上传 URL 必须是 http/https: {upload_url}")
    if not parsed_url.netloc:
        raise ValueError(f"签名上传 URL 缺少 host: {upload_url}")

    path_with_query = urllib.parse.urlunsplit(("", "", parsed_url.path or "/", parsed_url.query, ""))
    connection_cls = http.client.HTTPSConnection if parsed_url.scheme == "https" else http.client.HTTPConnection
    connection = connection_cls(parsed_url.netloc, timeout=300)

    try:
        connection.request("PUT", path_with_query, body=file_bytes, headers={})
        response = connection.getresponse()
        response_body = response.read()
    finally:
        connection.close()

    if response.status not in {200, 201, 204}:
        error_body = response_body.decode("utf-8", errors="replace")
        raise RuntimeError(f"签名 URL 上传失败,HTTP {response.status}: {error_body}")


def request_bytes_from_url(url, method, body=None, headers=None, timeout_seconds=300):
    parsed_url = urllib.parse.urlsplit(url)
    if parsed_url.scheme not in {"http", "https"}:
        raise ValueError(f"URL 必须是 http/https: {url}")
    if not parsed_url.netloc:
        raise ValueError(f"URL 缺少 host: {url}")

    path_with_query = urllib.parse.urlunsplit(("", "", parsed_url.path or "/", parsed_url.query, ""))
    connection_cls = http.client.HTTPSConnection if parsed_url.scheme == "https" else http.client.HTTPConnection
    connection = connection_cls(parsed_url.netloc, timeout=timeout_seconds)

    try:
        connection.request(method, path_with_query, body=body, headers=headers or {})
        response = connection.getresponse()
        response_body = response.read()
    finally:
        connection.close()

    if response.status not in {200, 201, 204}:
        error_body = response_body.decode("utf-8", errors="replace")
        raise RuntimeError(f"{parsed_url.netloc} 请求失败,HTTP {response.status}: {error_body}")
    return response_body


def download_bytes_from_url(url):
    return request_bytes_from_url(url, "GET", headers={"Accept": "*/*"})


def upload_file_to_signed_url(pdf_path, upload_url):
    try:
        request_bytes_from_url(upload_url, "PUT", body=pdf_path.read_bytes(), headers={})
    except RuntimeError as exc:
        raise RuntimeError(f"{pdf_path.name} 上传失败: {exc}") from exc


def is_task_finished(state):
    return state in FINISHED_STATES


def poll_batch_results(batch_id, token, poll_seconds, timeout_seconds):
    deadline = time.monotonic() + timeout_seconds
    endpoint = f"{BATCH_RESULTS_ENDPOINT}/{batch_id}"

    while True:
        response_json = request_json("GET", endpoint, token)
        data = ensure_success_response(response_json)["data"]
        extract_results = data.get("extract_result") or []

        if extract_results and all(is_task_finished(result["state"]) for result in extract_results):
            failed_results = [result for result in extract_results if result["state"] == "failed"]
            if failed_results:
                details = "\n".join(
                    f"{result.get('file_name')}: {result.get('err_msg', '未知错误')}"
                    for result in failed_results
                )
                raise RuntimeError(f"MinerU 解析失败:\n{details}")
            return extract_results

        if time.monotonic() >= deadline:
            raise TimeoutError(f"等待 MinerU batch_id={batch_id} 超时。")

        states = ", ".join(
            f"{result.get('file_name')}={result.get('state')}" for result in extract_results
        )
        print(f"等待解析完成: {states or '尚无结果'}")
        time.sleep(poll_seconds)


def download_file(url, output_path):
    output_path.parent.mkdir(parents=True, exist_ok=True)
    output_path.write_bytes(download_bytes_from_url(url))


def download_and_extract_results(extract_results, output_dir):
    zip_dir = output_dir / "zip"
    extracted_dir = output_dir / "extracted"
    saved_zip_paths = []

    for result in extract_results:
        file_name = Path(result["file_name"]).stem
        zip_url = result.get("full_zip_url")
        if not zip_url:
            raise RuntimeError(f"{result['file_name']} 缺少 full_zip_url,无法下载结果。")

        zip_path = zip_dir / f"{file_name}.zip"
        download_file(zip_url, zip_path)
        saved_zip_paths.append(zip_path)

        target_dir = extracted_dir / file_name
        target_dir.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(zip_path) as archive:
            archive.extractall(target_dir)

    return saved_zip_paths


if __name__ == "__main__":
    PDF_DIR = Path("papers/raw_papers")
    OUTPUT_DIR = Path("papers") / "mineru_outputs"
    MODEL_VERSION = "vlm"#解析模型参数,官方推荐"vlm",解析精度最高
    EXPORT_FORMATS = ["docx", "html", "latex"]#转换后的文件格式参数: 除了 MinerU 默认输出的 Markdown、JSON、ZIP 内容外,脚本还要求额外导出:docx, html, latex
    ENABLE_FORMULA = True
    ENABLE_TABLE = True
    LANGUAGE = "ch" #语言参数: Chinese+English
    POLL_SECONDS = 15
    TIMEOUT_SECONDS = 60 * 60

    api_token = get_api_token()
    pdf_paths = collect_pdf_paths(PDF_DIR)
    validate_pdf_batch(pdf_paths)

    print(f"准备上传 {len(pdf_paths)} 个 PDF。")
    batch_id, signed_upload_urls = request_local_upload_urls(
        pdf_paths,
        api_token,
        MODEL_VERSION,
        EXPORT_FORMATS,
        ENABLE_FORMULA,
        ENABLE_TABLE,
        LANGUAGE,
    )
    print(f"batch_id: {batch_id}")

    for current_pdf_path, signed_upload_url in zip(pdf_paths, signed_upload_urls):
        print(f"上传: {current_pdf_path.name}")
        upload_file_to_signed_url(current_pdf_path, signed_upload_url)

    print("上传完成,开始轮询解析结果。")
    completed_results = poll_batch_results(batch_id, api_token, POLL_SECONDS, TIMEOUT_SECONDS)
    saved_zips = download_and_extract_results(completed_results, OUTPUT_DIR)
    print("下载完成:")
    for saved_zip in saved_zips:
        print(saved_zip)

这里面有4个核心参数,分别是language,is_ocr,model_version,extra_formats。详细解释可以查阅官方文档:核心参数详解及注意事项和官方教程:MinerU保姆级教程|全面掌握 MinerU 在线 API 调用

因为我有把代码上传到github托管的习惯,我在项目根目录下的.vscode文件下编写了launch.json文件,在项目根目录下编写了.env文件,以防止API被 Git 跟踪。

①./.env文件:存放API

bash 复制代码
MINERU_API_TOKEN=API

②./.vscode/launch.json文件

json 复制代码
{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Run MinerU Batch PDF",
      "type": "python",
      "request": "launch",
      "program": "${workspaceFolder}/tools/batch_pdf_to_markdown.py",
      "console": "integratedTerminal",
      "envFile": "${workspaceFolder}/.env"
    }
  ]
}

3.2 转换结果

每篇论文单独保存在一个文件夹中,文件夹名称是原pdf名称

4. 使用MCP server调用minerU API

还有一种调用API的方式,把现在的batch_pdf_to_markdown.py 从"直接运行的脚本"改造成一个 MinerU MCP server ,让 cursor、Codex、claude code等AI agent通过 MCP 工具调用它。这样我们可以使用自然语言把PDF转换为markdown。例如,

用 MinerU 把 papers/raw_papers 里的 PDF 转成 Markdown,输出到 papers/mineru_outputs,模型用 vlm。

4.1 一个调用API解析本地论文的脚本文件

总结

今天介绍的是如何在一个脚本文件中调用API,解析本地论文(PDF格式),转换为机器可读的markdown文件。

参考

1\] [MinerU保姆级教程|全面掌握 MinerU 在线 API 调用](https://www.bilibili.com/video/BV1gJDLBpEdN/?spm_id_from=333.337.search-card.all.click&vd_source=f1774d51fd2e9e2315b1939f6ccaa936) \[2\] [PDF导入obsidian,打造本地知识库](https://www.bilibili.com/video/BV1HvdvBrEXi/?spm_id_from=333.337.search-card.all.click&vd_source=f1774d51fd2e9e2315b1939f6ccaa936) \[3\] [MinerU API文档](https://mineru.net/apiManage/docs)

相关推荐
钱端工程师6 小时前
vue自定义一个在线查看文件的组件(.xlsx、.docx、.pdf、图片等)
javascript·vue.js·pdf
web打印社区21 小时前
2026最新Web静默打印解决方案,无插件无预览,完美替代Lodop
前端·javascript·vue.js·electron·pdf
其实防守也摸鱼1 天前
带你了解与配置phpmyadmin
笔记·安全·网络安全·pdf·编辑器·工具·调试
sensor_WU1 天前
【原创】PDF 加密、管控终极工具
pdf·pdf 加密·pdf 转 exe·pdf aes256·pdf 密码
AI刀刀1 天前
手机AI怎么导出pdf
人工智能·ai·智能手机·pdf·deepseek·ds随心转
eastyuxiao1 天前
OpenClaw 免费 PDF 技能组合 + 一键指令(零收费、无订阅)
pdf
eastyuxiao1 天前
OpenClaw的PDF处理Skill收费吗?
人工智能·pdf
Tdsay_1 天前
PDF 加水印怎么做?一次真实使用在线工具的体验分享(免费、本地处理)
pdf·pdf加水印·土豆丝在线工具
2501_930707781 天前
使用C#代码在PDF文件中添加图片水印
pdf