python识别图片中的文本保存到word中

python可以使用第三方库pytesseract实现图像的文本识别，并将识别的结果保存到word中，代码本生不复杂pytesseract环境有点麻烦这里整理总结一下

一、简介

Tesseract是一个由HP实验室开发由Google维护的开源的光学字符识别（OCR）引擎，可以在 Apache 2.0 许可下获得。它可以直接使用，或者（对于程序员）使用 API 从图像中提取输入，包括手写的或打印的文本。

二、包安装

复制代码

pip install pytesseract

三、代码

复制代码

import pytesseract
from PIL import Image
from docx import Document


def convert_image_to_editable_docx(image_file, docx_file):
    # 读取图片并进行OCR识别
    image = Image.open(image_file)
    # 使用pytesseract调用image_to_string方法进行识别，传入要识别的图片，lang='chi_sim'是设置为中文识别，
    text = pytesseract.image_to_string(image, lang='chi_sim')

    # 创建Word文档并插入文本
    doc = Document()
    doc.add_paragraph(text)
    doc.save(docx_file)

# 示例用法
input_image = "1.png"   # 输入图片文件路径
output_docx = "output.docx"   # 输出Word文档路径

convert_image_to_editable_docx(input_image, output_docx)

不安装环境运行代码会报错：pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

四、Tesseract的常用网址

下载地址：https://digi.bib.uni-mannheim.de/tesseract/

官方网站：https://github.com/tesseract-ocr/tesseract

官方文档：https://github.com/tesseract-ocr/tessdoc

语言包地址：https://github.com/tesseract-ocr/tessdata

语言包国内地址：https://gitcode.com/mirrors/tesseract-ocr/tessdata/tree/main?utm_source=csdn_github_accelerator\&isLogin=1

五、安装

我下载的是：tesseract-ocr-w64-setup-v5.1.0.20220510.exe

我的安装地址是：D:\Program Files\Tesseract-OCR

六、设置环境变量

path中添加