【BUG】程序卡死,无法捕获异常,无法设置超时,无法使用线程池管理

目录

报错内容

在使用pymupdf解析PDF时,出现报错

bash 复制代码
MuPDF error: format error: object is not a stream
MuPDF error: syntax error: invalid ICC colorspace
MuPDF error: syntax error: unknown cid font type
MuPDF error: format error: object is not a stream
MuPDF error: syntax error: invalid ICC colorspace
MuPDF error: syntax error: unknown cid font type
MuPDF error: syntax error: unknown cid font type
MuPDF error: syntax error: unknown cid font type
MuPDF error: library error: zlib error: invalid stored block lengths

试错方案

这些是尝试解决问题的过程,都是无效方案,可以跳过

因为我是记录bug笔记,所以写在这里

捕获更具体的异常

失败

bash 复制代码
import fitz
def pdf2text(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file) # open a document
        for page in doc: # iterate the document pages
            text = page.get_text() # get plain text (is in UTF-8)
            texts += text
        data['document'] = texts
        data['metadata'] =  pdf_file
        data['format'] =  "pdf_text"
        return data
    except fitz.FitzError as fe:
        print("MuPDF specific error:", fe)
        return False
    except Exception as e:
        print(e)
        return False

设置超时

失败

bash 复制代码
import fitz
import signal

def handler(signum, frame):
    raise TimeoutError("Operation timed out")

signal.signal(signal.SIGALRM, handler)

def pdf2text(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file) # open a document
        for page in doc: # iterate the document pages
            signal.alarm(10)  # 设置超时为 10 秒
            text = page.get_text() # get plain text (is in UTF-8)
            signal.alarm(0)  # 取消超时
            texts += text
        data['document'] = texts
        data['metadata'] =  pdf_file
        data['format'] =  "pdf_text"
        return data
    except TimeoutError as te:
        print("Timeout error:", te)
        return False
    except Exception as e:
        print(e)
        return False

使用子进程

单独运行文件成功了,但是在服务调用时仍然卡死

bash 复制代码
import fitz
import multiprocessing

def extract_text(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file)
        for page in doc:
            texts += page.get_text()
        data['document'] = texts
        data['metadata'] = pdf_file
        data['format'] = "pdf_text"
        return data
    except Exception as e:
        print(f"Error in subprocess: {e}")
        return None

def pdf2text(pdf_file):
    try:
        with multiprocessing.Pool(1) as pool:
            result = pool.apply_async(extract_text, (pdf_file,))
            return result.get(timeout=5)  # 设置超时为30秒
    except multiprocessing.TimeoutError:
        print("Subprocess timed out.")
        return False
    except Exception as e:
        print(f"Main process error: {e}")
        return False

使用线程池

失败

bash 复制代码
from concurrent.futures import ThreadPoolExecutor, TimeoutError
import fitz

def parse_pdf_page(pdf_file):
    try:
        data = dict()
        texts = ''
        doc = fitz.open(pdf_file)
        for page in doc:
            text = page.get_text()
            texts += text
        data['document'] = texts
        data['metadata'] = pdf_file
        data['format'] = "pdf_text"
        return data
    except Exception as e:
        return str(e)

def pdf2text_with_threadpool(pdf_file, timeout=30):
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(parse_pdf_page, pdf_file)
        try:
            return future.result(timeout=timeout)
        except TimeoutError:
            print("PDF parsing timed out.")
            return False

# 服务中调用示例
result = pdf2text_with_threadpool('your_pdf_file.pdf')

成功方案

使用 subprocess 调用外部脚本

脚本中设置子进程超时反馈

主进程

main.py

bash 复制代码
import subprocess
import json
pdf_file_path = "your_pdf_file"
result = subprocess.run([
        'python', 
        'pdf_demo.py',
          pdf_file_path
          ], 
          capture_output=True, 
          text=True)
try:
    result = json.loads(result.stdout)
    print(result)
except json.JSONDecodeError:
    print("未解析",repr(result.stdout))

外部脚本

pdf_demo.py

bash 复制代码
import fitz
import multiprocessing
import sys
import json
def pdf2text(pdf_file):
    data = dict()
    texts = ''

    doc = fitz.open(pdf_file)
    for page in doc:
        text = page.get_text()
        texts += text

    data['document'] = texts
    data['metadata'] =  pdf_file
    data['format'] =  "pdf_text"
    return data


if __name__ == "__main__":
    pdf_file_path = sys.argv[1]
    pool = multiprocessing.Pool(1)
    result = pool.apply_async(pdf2text, (pdf_file_path,))
    try:
        data = result.get(timeout=5)
        print(json.dumps(data, ensure_ascii=False, indent=4))
    except multiprocessing.TimeoutError:
        pool.terminate()
        print("Subprocess timed out.")
    except Exception as e:
        print(f"Main process error: {e}")
    finally:
        pool.close()
        pool.join()
相关推荐
可乐鸡翅好好吃7 小时前
通过BUG(prvIdleTask、pxTasksWaitingTerminatio不断跳转问题)了解空闲函数(prvIdleTask)和TCB
c语言·stm32·单片机·嵌入式硬件·bug·keil
神膘护体小月半14 小时前
bug 记录 - 使用 el-dialog 的 before-close 的坑
前端·javascript·bug
顽强d石头19 小时前
bug:undefined is not iterable (cannot read property Symbol(Symbol.iterator))
前端·bug
阿松のblog1 天前
opencv使用经典bug
人工智能·opencv·bug
学习啷个办2 天前
centos挂载目录满但实际未满引发系统宕机
bug
我们的五年3 天前
【Qt】Bug:findChildren找不到控件
开发语言·qt·bug
seiyaaa3 天前
Claude Opus solved my white whale bug today that I couldn‘t find in 4 years
bug
六天测试工程师3 天前
做好 4个基本动作,拦住性能优化改坏原功能的bug
服务器·性能优化·bug
良辰美景好时光4 天前
keepalived定制日志bug
linux·运维·bug
CYRUS STUDIO4 天前
FART 自动化脱壳框架一些 bug 修复记录
android·bug·逆向·fart·脱壳