书籍切片举例-Python代码生成excel表格后转换为SQLite数据库

操作步骤：

流程：epub->html->xls->db
首先将书籍转换为html格式文件，观察书籍结构及特征符号，如<h1>[]{}<>《》【】等等标记符号来设定列
用python实现：读取附件html格式书籍，从<h2 class="sect2" id="text0引导的"王绵之：加味香苏散"开始，表头定义为方剂名称，其后内容表头名分别为：组成、功效、主治、用法、经验，提取其内容到excel文件，以下是一条完整的结构：

王绵之：加味香苏散

【组成】紫苏叶5g，陈皮、香附各4g，炙甘草2.5g，荆芥、秦艽、防风、蔓荆子各3g，川芎1.5g，生姜3片。

【功效】发汗解表。

【主治】四时感冒之风寒表证，可见头痛项强、鼻塞流涕、身体疼痛、发热恶寒或恶风、无汗、舌苔薄白、脉浮等。

【用法】水煎服，每日1剂。

【经验】本方王老多用于治疗四时感冒之风寒表证的轻症。方中紫苏叶、荆芥解表，秦艽、防风解散肌表风寒，川芎助紫苏叶、蔓荆子上行，散风止痛。对四时感冒之风寒表证，加味香苏散是一个基本方，适于体质较弱的老人或小孩，以及妇女经期的感冒。此外，方中有香附、陈皮、紫苏叶，所以素有胃脘痛、胃气痛、胃寒痛，又外感寒邪，除感冒症状外尚有胃脘不舒者，可用此方，因为其有理气健胃的功效。〔《健康时报》编辑部.全家人的小药方2［M］.北京：北京科学技术出版社，2012，4-5〕

Python源代码

用于解析HTML文件并提取方剂信息到Excel文件

python 复制代码

import pandas as pd
from bs4 import BeautifulSoup
import re

# 读取HTML文件，相同结构仅更改下面的书籍名称即可
with open('国医大师专科专病用方经验.第1辑.肺系病分册.html', 'r', encoding='utf-8') as f:
    html_content = f.read()

# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(html_content, 'html.parser')

# 找到所有<h2 class="sect2">标签
h2_tags = soup.find_all('h2', class_='sect2')

data = []
current_section = None

for h2 in h2_tags:
    # 提取方剂名称（去掉HTML标签）
    formula_name = h2.get_text(strip=True)
    
    # 跳过非方剂标题（根据内容判断，方剂标题通常包含"："）
    if "：" not in formula_name:
        continue
    
    # 初始化字典存储当前方剂信息
    formula_data = {
        "方剂名称": formula_name,
        "组成": "",
        "功效": "",
        "主治": "",
        "用法": "",
        "经验": ""
    }
    
    # 查找当前h2标签后的所有兄弟标签，直到遇到下一个h2标签
    current_element = h2.next_sibling
    
    while current_element and (current_element.name != 'h2' or 'sect2' not in current_element.get('class', [])):
        if current_element.name == 'p':
            # 获取段落文本
            text = current_element.get_text(strip=True)
            
            # 使用正则表达式提取各部分内容
            if "【组成】" in text:
                formula_data["组成"] = text.replace("【组成】", "").strip()
            elif "【功效】" in text:
                formula_data["功效"] = text.replace("【功效】", "").strip()
            elif "【主治】" in text:
                formula_data["主治"] = text.replace("【主治】", "").strip()
            elif "【用法】" in text:
                formula_data["用法"] = text.replace("【用法】", "").strip()
            elif "【经验】" in text:
                formula_data["经验"] = text.replace("【经验】", "").strip()
        
        current_element = current_element.next_sibling
        
        # 如果遇到None（文档结束），跳出循环
        if current_element is None:
            break
    
    # 将当前方剂数据添加到列表
    data.append(formula_data)

# 创建DataFrame
df = pd.DataFrame(data)

# 保存到Excel文件
df.to_excel('国医大师方剂提取结果.xlsx', index=False, engine='openpyxl')

print(f"成功提取了{len(data)}个方剂，已保存到Excel文件")

如果你需要更精确的提取（考虑到HTML结构可能更复杂），这里有一个增强版本：

python 复制代码

import pandas as pd
from bs4 import BeautifulSoup
import re

# 读取HTML文件
with open('国医大师专科专病用方经验.第1辑.肺系病分册.html', 'r', encoding='utf-8') as f:
    html_content = f.read()

# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(html_content, 'html.parser')

# 找到所有<h2 class="sect2">标签
h2_tags = soup.find_all('h2', class_='sect2')

data = []

for h2 in h2_tags:
    # 提取方剂名称
    formula_name = h2.get_text(strip=True)
    
    # 只处理包含冒号的方剂名称（医家：方剂名格式）
    if "：" not in formula_name:
        continue
    
    # 初始化字典
    formula_data = {
        "方剂名称": formula_name,
        "组成": "",
        "功效": "",
        "主治": "",
        "用法": "",
        "经验": ""
    }
    
    # 查找当前h2之后的所有<p>标签，直到遇到下一个h2.sect2
    current = h2.find_next_sibling()
    
    while current and not (current.name == 'h2' and 'sect2' in current.get('class', [])):
        if current.name == 'p':
            # 查找是否有<span class="emphasis_bold">
            bold_span = current.find('span', class_='emphasis_bold')
            if bold_span:
                section_type = bold_span.get_text(strip=True)
                # 获取整个段落的文本
                full_text = current.get_text(strip=True)
                
                # 移除标签部分，只保留内容
                content = full_text.replace(section_type, "").strip()
                
                # 根据标签类型分配到对应字段
                if "【组成】" in section_type:
                    formula_data["组成"] = content
                elif "【功效】" in section_type:
                    formula_data["功效"] = content
                elif "【主治】" in section_type:
                    formula_data["主治"] = content
                elif "【用法】" in section_type:
                    formula_data["用法"] = content
                elif "【经验】" in section_type:
                    formula_data["经验"] = content
        
        current = current.find_next_sibling()
    
    # 添加到数据列表
    data.append(formula_data)

# 创建DataFrame并保存到Excel
df = pd.DataFrame(data)

# 保存到Excel文件
output_file = '国医大师方剂提取结果.xlsx'
df.to_excel(output_file, index=False, engine='openpyxl')

print(f"成功提取了{len(data)}个方剂")
print(f"已保存到文件: {output_file}")

# 显示前几行数据
print("\n前5个方剂示例:")
print(df.head())

使用说明：

将上述代码保存为.py文件（如extract_formulas.py）
确保HTML文件与Python脚本在同一目录下，或修改文件路径

安装必要的库：

复制代码

pip install pandas openpyxl beautifulsoup4

运行脚本

输出说明：

程序会创建一个名为国医大师方剂提取结果.xlsx的Excel文件
文件包含6列：方剂名称、组成、功效、主治、用法、经验
程序会输出提取的方剂数量

注意事项：

代码假设每个方剂部分都以<h2 class="sect2">开始
每个方剂的各个部分（组成、功效等）都包含在<p>标签中，且内部有<span class="emphasis_bold">标签
如果HTML结构有变化，可能需要调整解析逻辑

示例下载

包含国医大师专科专病用方经验html、xlsx及源代码，如需书籍转html工具，请联系作者高山羊止或留言
https://wwbur.lanzoul.com/ia1D23i1tzfc