用Python脚本批量发布Markdown文章，我踩了三个坑才搞定

背景

事情是这样的，我平时有在本地用Markdown写技术笔记的习惯，几年下来积累了上百篇。最近想把它们整理一下，发布到我的个人博客上，一方面做个备份，另一方面也方便分享。我的博客后台支持Markdown格式，但手动操作实在要命：每篇都要点"新建文章"，复制标题、内容，设置分类标签，上传封面图......发个三五篇还行，几十上百篇简直是不可能完成的任务。

作为一个懒人程序员，我的第一反应就是：写个脚本自动化搞定。理想很丰满，觉得无非就是读取文件、解析内容、调用API发送。但真正动手之后才发现，从本地Markdown到线上可发布的文章，中间隔着好几个坑。这篇文章就记录了我从"觉得简单"到"终于跑通"的全过程。

问题分析

我最开始的思路非常简单粗暴：

遍历指定文件夹的所有 .md 文件。
用 open().read() 读取文件内容。
把内容和标题通过博客平台的API发出去。

但第一版脚本跑起来就失败了。首先，我的笔记里有很多图片，链接都是本地的相对路径，比如 ![示意图](./images/my-pic.png)。直接把这个字符串发到线上，图片肯定显示不了。其次，我的博客API要求分类和标签是ID，但我笔记里习惯用"#Python"、"#踩坑记录"这样的文字标签。最后，博客文章需要一个摘要，我原本打算截取正文前200字，但有些文章开头是代码块，直接截取会破坏格式。

看来，简单的"读取-发送"行不通，必须在中间加一个处理层，把本地的、非结构化的Markdown，转换成符合API要求的结构化数据。这个处理层至少要解决三个问题：元信息提取、图片处理和内容格式化。

核心实现

第一步：解析Markdown，提取元信息和正文

我需要从Markdown文件中分离出标题、日期、标签这些元信息，以及纯粹的正文内容。我观察到自己的笔记有个习惯：通常在最前面用YAML Front Matter（就是被---包裹的部分）来记录这些信息。如果没有，标题就是第一个一级标题。

我决定使用 python-frontmatter 这个库来解析Front Matter，用 markdown 库来帮助处理一些Markdown语法。这里有个坑：python-frontmatter 安装时名字是 python-frontmatter，但导入时是 import frontmatter。

python 复制代码

import os
import frontmatter
from markdown import Markdown
from io import StringIO

def parse_markdown_file(file_path):
    """
    解析Markdown文件，提取元数据和纯正文。
    
    Args:
        file_path: Markdown文件的路径
    
    Returns:
        dict: 包含标题、内容、标签等信息的字典
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        # 使用frontmatter库解析，它能自动处理有无Front Matter的情况
        post = frontmatter.load(f)
    
    # 提取元数据
    # 标题优先从Front Matter的'title'字段获取，否则使用第一个一级标题
    title = post.get('title', '')
    # 标签从Front Matter的'tags'字段获取，默认为空列表
    tags = post.get('tags', [])
    # 其他可能存在的元数据，如日期、分类
    date = post.get('date', None)
    category = post.get('category', '')
    
    # 获取纯文本内容（frontmatter.content会自动去掉Front Matter部分）
    content = post.content
    
    # 如果frontmatter里没有标题，尝试从内容中找第一个#标题
    if not title:
        lines = content.split('\n')
        for line in lines:
            if line.startswith('# '):
                title = line.lstrip('# ').strip()
                break
        # 如果还找不到，就用文件名（不含后缀）
        if not title:
            title = os.path.splitext(os.path.basename(file_path))[0]
    
    return {
        'title': title,
        'content': content,
        'tags': tags,
        'date': date,
        'category': category,
        'source_path': file_path
    }

第二步：处理本地图片，上传并替换链接

这是最棘手的一步。我的脚本需要能识别出内容里的本地图片标记，上传到博客的图床（或媒体库），然后把Markdown中的图片链接替换成线上URL。

我写了一个函数来查找所有本地图片链接。这里注意，Markdown图片语法是 ![alt](url)，url可能是相对路径、绝对路径，也可能是已经存在的网络图片。我需要过滤出那些指向本地文件的路径。

python 复制代码

import re
import requests
from pathlib import Path

def find_local_images(content, md_file_path):
    """
    在Markdown内容中查找所有本地图片的引用。
    
    Args:
        content: Markdown文本内容
        md_file_path: 原Markdown文件的绝对路径，用于解析相对路径
    
    Returns:
        list: 包含图片信息的字典列表，格式为 [{'alt':'描述', 'local_path':'本地路径'}, ...]
    """
    # 正则匹配Markdown图片语法 !\[.*?\]\((.*?)\)
    # 重点：匹配括号内的链接部分
    pattern = r'!\[(.*?)\]\((.*?)\)'
    matches = re.findall(pattern, content)
    
    local_images = []
    md_dir = os.path.dirname(md_file_path)
    
    for alt_text, img_path in matches:
        # 跳过已经是网络URL的图片（以http/https开头）
        if img_path.startswith(('http://', 'https://')):
            continue
            
        # 构造本地图片的绝对路径
        # 如果img_path是绝对路径，直接使用；否则，视为相对于md文件所在目录的路径
        if not os.path.isabs(img_path):
            abs_path = os.path.join(md_dir, img_path)
        else:
            abs_path = img_path
        
        # 检查文件是否存在
        if os.path.exists(abs_path):
            local_images.append({
                'alt': alt_text,
                'local_path': abs_path,
                'markdown_ref': img_path  # 原始Markdown中的引用路径，用于后续替换
            })
        else:
            print(f"警告：图片文件不存在 {abs_path}")
    
    return local_images

找到图片后，就需要上传。我的博客平台提供了文件上传API，返回一个访问URL。我模拟了这个过程，核心是使用 requests 库发送 multipart/form-data 请求。

python 复制代码

def upload_image_to_blog(local_image_path, alt_text=''):
    """
    模拟将本地图片上传到博客平台。
    在实际使用中，你需要替换成自己博客平台的真实API。
    
    Args:
        local_image_path: 本地图片文件路径
        alt_text: 图片描述文本
    
    Returns:
        str: 上传成功后，图片的网络访问URL
    """
    # ！！！这里是需要你根据自己博客API修改的部分 ！！！
    upload_url = "https://api.your-blog.com/v1/upload"  # 替换为真实上传地址
    headers = {
        "Authorization": "Bearer YOUR_ACCESS_TOKEN"  # 替换为你的认证信息
    }
    
    with open(local_image_path, 'rb') as f:
        files = {'file': (os.path.basename(local_image_path), f, 'image/png')} # 可根据文件类型调整mime
        try:
            # 发送上传请求
            resp = requests.post(upload_url, headers=headers, files=files, timeout=30)
            resp.raise_for_status()  # 如果状态码不是200，抛出异常
            result = resp.json()
            # 假设API返回的JSON中有一个'url'字段
            image_url = result.get('url')
            if not image_url:
                print(f"警告：上传成功但未返回URL，响应内容：{result}")
                return None
            print(f"图片上传成功：{local_image_path} -> {image_url}")
            return image_url
        except requests.exceptions.RequestException as e:
            print(f"图片上传失败 {local_image_path}: {e}")
            return None

有了上传函数，就可以遍历所有本地图片，上传并替换原文中的链接了。这里要注意，替换时要使用原始的那个相对路径字符串（markdown_ref）进行精确替换。

python 复制代码

def process_and_replace_images(parsed_post):
    """
    处理文章中的本地图片：上传并替换链接。
    会直接修改传入的 parsed_post['content']。
    
    Args:
        parsed_post: 从 parse_markdown_file 返回的字典
    
    Returns:
        dict: 更新了content字段的parsed_post
    """
    content = parsed_post['content']
    md_file_path = parsed_post['source_path']
    
    local_images = find_local_images(content, md_file_path)
    
    if not local_images:
        print(f"文章 '{parsed_post['title']}' 未发现本地图片。")
        return parsed_post
    
    print(f"文章 '{parsed_post['title']}' 发现 {len(local_images)} 张本地图片，开始处理...")
    
    for img_info in local_images:
        online_url = upload_image_to_blog(img_info['local_path'], img_info['alt'])
        if online_url:
            # 关键步骤：将原文中的旧路径替换为新的网络URL
            # 使用原始markdown_ref进行精确替换，避免误替换
            old_md_syntax = f'![{img_info[\"alt\"]}]({img_info[\"markdown_ref\"]})'
            new_md_syntax = f'![{img_info[\"alt\"]}]({online_url})'
            content = content.replace(old_md_syntax, new_md_syntax)
        else:
            print(f"图片 {img_info['local_path']} 上传失败，链接未替换。")
    
    parsed_post['content'] = content
    return parsed_post

第三步：调用发布API，完成文章创建

处理完内容，最后一步就是调用博客的发布接口。这里需要将标签、分类等文本信息，转换为平台所需的ID格式。我通常会在脚本里维护一个映射字典。另外，为了友好，我添加了简单的进度提示。

python 复制代码

def publish_post_to_blog(post_data):
    """
    模拟调用博客平台的API来发布文章。
    
    Args:
        post_data: 包含标题、内容、标签等信息的字典
    
    Returns:
        bool: 发布是否成功
    """
    # ！！！这里是需要你根据自己博客API修改的部分 ！！！
    api_url = "https://api.your-blog.com/v1/posts"
    headers = {
        "Authorization": "Bearer YOUR_ACCESS_TOKEN",
        "Content-Type": "application/json"
    }
    
    # 将标签文字转换为平台ID（这里简化处理，实际可能需要查询API）
    tag_name_to_id = {"Python": 1, "踩坑记录": 2, "自动化": 3} # 示例映射
    tag_ids = []
    for tag_name in post_data.get('tags', []):
        if tag_name in tag_name_to_id:
            tag_ids.append(tag_name_to_id[tag_name])
        else:
            print(f"警告：标签 '{tag_name}' 未找到对应ID，已忽略。")
    
    # 构造API请求体
    payload = {
        "title": post_data['title'],
        "content": post_data['content'],
        "status": "publish",  # 直接发布，也可以是'draft'存草稿
        "tags": tag_ids,
        "category": post_data.get('category', ''),
    }
    
    try:
        resp = requests.post(api_url, json=payload, headers=headers, timeout=30)
        resp.raise_for_status()
        print(f"文章发布成功！标题：{post_data['title']}")
        return True
    except requests.exceptions.RequestException as e:
        print(f"文章发布失败 '{post_data['title']}': {e}")
        if resp:
            print(f"响应内容：{resp.text}")
        return False

完整代码

把上面的步骤串联起来，就得到了主程序。我增加了命令行参数支持，可以指定要发布的文件夹路径。

python 复制代码

#!/usr/bin/env python3
"""
Markdown文章批量发布脚本
作者：一个踩坑的全栈
功能：遍历指定目录下的.md文件，解析内容，上传图片，并发布到博客平台。
使用前请根据你的博客API修改 upload_image_to_blog 和 publish_post_to_blog 函数。
"""

import os
import re
import argparse
import frontmatter
import requests
from pathlib import Path
from time import sleep

# ---------- 第一部分：解析Markdown ----------
def parse_markdown_file(file_path):
    """解析Markdown文件，提取元数据和纯正文。"""
    with open(file_path, 'r', encoding='utf-8') as f:
        post = frontmatter.load(f)
    
    title = post.get('title', '')
    tags = post.get('tags', [])
    date = post.get('date', None)
    category = post.get('category', '')
    content = post.content
    
    if not title:
        lines = content.split('\n')
        for line in lines:
            if line.startswith('# '):
                title = line.lstrip('# ').strip()
                break
        if not title:
            title = os.path.splitext(os.path.basename(file_path))[0]
    
    return {
        'title': title,
        'content': content,
        'tags': tags,
        'date': date,
        'category': category,
        'source_path': file_path
    }

# ---------- 第二部分：处理图片 ----------
def find_local_images(content, md_file_path):
    """在Markdown内容中查找所有本地图片的引用。"""
    pattern = r'!\[(.*?)\]\((.*?)\)'
    matches = re.findall(pattern, content)
    
    local_images = []
    md_dir = os.path.dirname(md_file_path)
    
    for alt_text, img_path in matches:
        if img_path.startswith(('http://', 'https://')):
            continue
            
        if not os.path.isabs(img_path):
            abs_path = os.path.join(md_dir, img_path)
        else:
            abs_path = img_path
        
        if os.path.exists(abs_path):
            local_images.append({
                'alt': alt_text,
                'local_path': abs_path,
                'markdown_ref': img_path
            })
        else:
            print(f"警告：图片文件不存在 {abs_path}")
    
    return local_images

def upload_image_to_blog(local_image_path, alt_text=''):
    """
    模拟将本地图片上传到博客平台。
    【重要】请根据你的博客API修改此函数！
    """
    # ！！！ 示例代码，需要替换 ！！！
    upload_url = "https://api.your-blog.com/v1/upload"
    headers = {"Authorization": "Bearer YOUR_ACCESS_TOKEN"}
    
    with open(local_image_path, 'rb') as f:
        files = {'file': (os.path.basename(local_image_path), f, 'image/png')}
        try:
            resp = requests.post(upload_url, headers=headers, files=files, timeout=30)
            resp.raise_for_status()
            result = resp.json()
            image_url = result.get('url')
            if not image_url:
                print(f"警告：上传成功但未返回URL")
                return None
            print(f"图片上传成功：{local_image_path}")
            return image_url
        except Exception as e:
            print(f"图片上传失败 {local_image_path}: {e}")
            return None

def process_and_replace_images(parsed_post):
    """处理文章中的本地图片：上传并替换链接。"""
    content = parsed_post['content']
    md_file_path = parsed_post['source_path']
    
    local_images = find_local_images(content, md_file_path)
    
    if not local_images:
        return parsed_post
    
    print(f"处理图片，共 {len(local_images)} 张...")
    
    for img_info in local_images:
        online_url = upload_image_to_blog(img_info['local_path'], img_info['alt'])
        if online_url:
            old_syntax = f'![{img_info[\"alt\"]}]({img_info[\"markdown_ref\"]})'
            new_syntax = f'![{img_info[\"alt\"]}]({online_url})'
            content = content.replace(old_syntax, new_syntax)
        else:
            print(f"图片上传失败，链接未替换。")
    
    parsed_post['content'] = content
    return parsed_post

# ---------- 第三部分：发布文章 ----------
def publish_post_to_blog(post_data):
    """
    模拟调用博客平台的API来发布文章。
    【重要】请根据你的博客API修改此函数！
    """
    # ！！！ 示例代码，需要替换 ！！！
    api_url = "https://api.your-blog.com/v1/posts"
    headers = {
        "Authorization": "Bearer YOUR_ACCESS_TOKEN",
        "Content-Type": "application/json"
    }
    
    tag_name_to_id = {"Python": 1, "踩坑记录": 2, "自动化": 3}
    tag_ids = []
    for tag_name in post_data.get('tags', []):
        if tag_name in tag_name_to_id:
            tag_ids.append(tag_name_to_id[tag_name])
        else:
            print(f"警告：标签 '{tag_name}' 未找到对应ID")
    
    payload = {
        "title": post_data['title'],
        "content": post_data['content'],
        "status": "publish",
        "tags": tag_ids,
        "category": post_data.get('category', ''),
    }
    
    try:
        resp = requests.post(api_url, json=payload, headers=headers, timeout=30)
        resp.raise_for_status()
        print(f"发布成功！标题：{post_data['title']}")
        return True
    except Exception as e:
        print(f"发布失败 '{post_data['title']}': {e}")
        return False

# ---------- 主程序 ----------
def main(markdown_dir):
    """遍历目录，处理并发布所有Markdown文件。"""
    if not os.path.isdir(markdown_dir):
        print(f"错误：目录不存在 {markdown_dir}")
        return
    
    md_files = []
    for root, dirs, files in os.walk(markdown_dir):
        for file in files:
            if file.lower().endswith('.md'):
                md_files.append(os.path.join(root, file))
    
    if not md_files:
        print(f"在目录 {markdown_dir} 中未找到.md文件。")
        return
    
    print(f"找到 {len(md_files)} 篇Markdown文章，开始处理...")
    
    success_count = 0
    for i, md_file in enumerate(md_files, 1):
        print(f"\n--- 正在处理第 {i}/{len(md_files)} 篇: {os.path.basename(md_file)} ---")
        
        # 1. 解析
        parsed = parse_markdown_file(md_file)
        print(f"标题：{parsed['title']}")
        
        # 2. 处理图片
        parsed = process_and_replace_images(parsed)
        
        # 3. 发布
        if publish_post_to_blog(parsed):
            success_count += 1
        
        # 可选：避免请求过快，添加延迟
        sleep(1)
    
    print(f"\n处理完成！成功发布 {success_count}/{len(md_files)} 篇文章。")

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='批量发布Markdown文章到博客')
    parser.add_argument('dir', help='包含.md文件的目录路径')
    args = parser.parse_args()
    
    main(args.dir)

踩坑记录

编码问题导致乱码 ：一开始没指定 encoding='utf-8'，在Windows下打开某些包含中文的Markdown文件时直接报错 UnicodeDecodeError。解决：所有文件操作都显式指定 utf-8 编码。
图片路径解析错误 ：我的正则最初只匹配了 (.*)，结果当图片描述里包含右括号 ) 时，匹配就提前结束了，比如 ![图(1)](./pic.png)。解决：将正则改为惰性匹配 (.*?)，确保匹配到第一个右括号就停止。
替换图片链接时误伤 ：最早我用 content.replace(img_path, online_url) 来替换，结果发现如果两篇不同文章里的图片名字相同，或者正文其他地方出现了同样的路径字符串，就会被错误替换。解决：改为替换完整的Markdown图片语法 ![alt](old_path) -> ![alt](new_url)，做到了精确一对一替换。
API限流和超时 ：在批量上传几十张图片时，连续快速请求被服务器限流了，返回429错误。解决：在每处理完一篇文章或上传完一张图片后，用 time.sleep(1) 添加一个短暂的延迟，模拟人工操作，问题就解决了。同时给 requests 调用加上 timeout 参数，避免网络不佳时脚本无限挂起。

小结

通过这个项目，我最大的收获是认识到"自动化"不仅仅是调用API，更重要的是数据转换和异常处理。把本地杂乱的数据规整成API能"吃"下去的格式，这个过程占了80%的工作量。脚本跑通后，我一次性发布了50多篇旧笔记，节省了至少十几个小时的手动操作时间。后续还可以考虑加入失败重试、发布前预览、更复杂的元信息匹配等功能，让这个工具更加健壮。