Dify智能体平台源码二次开发笔记(6) - 优化知识库pdf文档的识别

目录

前言

新增PdfNewExtractor类

替换ExtractProcessor类

最终结果


前言

dify的1.1.3版本知识库pdf解析实现使用pypdfium2提取文本,主要存在以下问题:

  1. 文本提取能力有限,对表格和图片支持不足

  2. 缺乏专门的中文处理优化

  3. 没有文档结构分析

  4. 缺少文档质量评估

建议优化方案:

  1. 使用pdfplumber替代pypdfium2

  2. 增加OCR支持

  3. 优化中文处理逻辑

  4. 添加文档结构分析

  5. 实现智能表格识别

  6. 增加缓存机制

  7. 优化大文件处理

导入包pdfplumber和pytesseract

复制代码
pip install pdfplumber
pip install pytesseract

新增PdfNewExtractor类

新增一个PdfNewExtractor处理类替代老的PdfExtractor

python 复制代码
from collections.abc import Iterator
from typing import Optional, cast
import pdfplumber
import pytesseract
from PIL import Image
import io

from core.rag.extractor.blob.blob import Blob
from core.rag.extractor.extractor_base import BaseExtractor
from core.rag.models.document import Document
from extensions.ext_storage import storage

class PdfNewExtractor(BaseExtractor):
    """Enhanced PDF loader with improved text extraction, OCR support, and structure analysis.

    Args:
        file_path: Path to the PDF file to load.
        file_cache_key: Optional cache key for storing extracted text.
        enable_ocr: Whether to enable OCR for text extraction from images.
    """

    def __init__(self, file_path: str, file_cache_key: Optional[str] = None, enable_ocr: bool = False):
        """Initialize with file path and optional settings."""
        self._file_path = file_path
        self._file_cache_key = file_cache_key
        self._enable_ocr = enable_ocr

    def extract(self) -> list[Document]:
        """Extract text from PDF with caching support."""
        plaintext_file_exists = False
        if self._file_cache_key:
            try:
                text = cast(bytes, storage.load(self._file_cache_key)).decode("utf-8")
                plaintext_file_exists = True
                return [Document(page_content=text)]
            except FileNotFoundError:
                pass

        documents = list(self.load())
        text_list = []
        for document in documents:
            text_list.append(document.page_content)
        text = "\n\n".join(text_list)

        # Save plaintext file for caching
        if not plaintext_file_exists and self._file_cache_key:
            storage.save(self._file_cache_key, text.encode("utf-8"))

        return documents

    def load(self) -> Iterator[Document]:
        """Lazy load PDF pages with enhanced text extraction."""
        blob = Blob.from_path(self._file_path)
        yield from self.parse(blob)

    def parse(self, blob: Blob) -> Iterator[Document]:
        """Parse PDF with enhanced features including OCR and structure analysis."""
        with blob.as_bytes_io() as file_obj:
            with pdfplumber.open(file_obj) as pdf:
                for page_number, page in enumerate(pdf.pages):
                    # Extract text with layout preservation and encoding detection
                    content = page.extract_text(layout=True)
                    # Try to detect and fix encoding issues
                    try:
                        # First try to decode as UTF-8
                        content = content.encode('utf-8').decode('utf-8')
                    except UnicodeError:
                        try:
                            # If UTF-8 fails, try GB18030 (common Chinese encoding)
                            content = content.encode('utf-8').decode('gb18030', errors='ignore')
                        except UnicodeError:
                            # If all else fails, use a more lenient approach
                            content = content.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')
                    
                    # Extract tables if present
                    tables = page.extract_tables()
                    if tables:
                        table_text = "\n\nTables:\n"
                        for table in tables:
                            # Convert table to text format
                            table_text += "\n" + "\n".join(
                                ["\t".join([str(cell) if cell else "" for cell in row]) 
                                 for row in table]
                            )
                        content += table_text

                    # Perform OCR if enabled and text content is limited or contains potential encoding issues
                    if self._enable_ocr and (len(content.strip()) < 100 or any('\ufffd' in line for line in content.splitlines())):
                        image = page.to_image()
                        img_bytes = io.BytesIO()
                        image.original.save(img_bytes, format='PNG')
                        img_bytes.seek(0)
                        pil_image = Image.open(img_bytes)
                        # Use multiple language models and improve OCR accuracy
                        ocr_text = pytesseract.image_to_string(
                            pil_image,
                            lang='chi_sim+chi_tra+eng',  # Support both simplified and traditional Chinese
                            config='--psm 3 --oem 3'  # Use more accurate OCR mode
                        )
                        if ocr_text.strip():
                            # Clean and normalize OCR text
                            ocr_text = ocr_text.replace('\x0c', '').strip()
                            content = f"{content}\n\nOCR Text:\n{ocr_text}"

                    metadata = {
                        "source": blob.source,
                        "page": page_number,
                        "has_tables": bool(tables)
                    }
                    
                    yield Document(page_content=content, metadata=metadata)

替换ExtractProcessor类

在ExtractProcessor中把两处extractor = PdfExtractor(file_path),替换成extractor = PdfNewExtractor(file_path)。

分别在代码144行和148行

最终结果

经过测试,优化效果完美

相关推荐
EkihzniY28 分钟前
AI+OCR:解锁数字化新视界
人工智能·ocr
东哥说-MES|从入门到精通33 分钟前
GenAI-生成式人工智能在工业制造中的应用
大数据·人工智能·智能制造·数字化·数字化转型·mes
铅笔侠_小龙虾1 小时前
深度学习理论推导--梯度下降法
人工智能·深度学习
kaikaile19951 小时前
基于遗传算法的车辆路径问题(VRP)解决方案MATLAB实现
开发语言·人工智能·matlab
lpfasd1232 小时前
第1章_LangGraph的背景与设计哲学
人工智能
遇到困难睡大觉哈哈2 小时前
Harmony os——ArkTS 语言笔记(四):类、对象、接口和抽象类
java·笔记·spring·harmonyos·鸿蒙
Aevget2 小时前
界面组件Kendo UI for React 2025 Q3亮点 - AI功能全面提升
人工智能·react.js·ui·界面控件·kendo ui·ui开发
程序员东岸2 小时前
《数据结构——排序(中)》选择与交换的艺术:从直接选择到堆排序的性能跃迁
数据结构·笔记·算法·leetcode·排序算法
桜吹雪2 小时前
LangChain.js/DeepAgents可观测性
javascript·人工智能
&&Citrus2 小时前
【杂谈】SNNU公共计算平台:深度学习服务器配置与远程开发指北
服务器·人工智能·vscode·深度学习·snnu