tesseract ocr 安装/调用/训练

安装

jTessBoxEditor 训练工具自带 tesseract-ocr
VietOCR - Browse /jTessBoxEditor at SourceForge.net

jTessBoxEditor 需要 JDK 依赖 jdk-8u201-windows-x64.exe下载地址 Java Downloads | Oracle

tessseract 开源地址 https://github.com/tesseract-ocr/tesseract

PHP 调用 tessseract 的包 composer require thiagoalessio/tesseract_ocr

WINDOWS 系统环境变量 TESSDATA_PREFIX = D:\soft\jTessBoxEditor-2.5.0\tesseract-ocr\tessdata\

jTessBoxEditor-2.2.0 是 tessseract V4 引擎

jTessBoxEditor-2.5.0 是 tessseract V5 引擎

调用

tesseract 1.png output -l eng //tesseract.exe 直接调用，会在目录下产生一个 output.txt 文件， -l 指定哪种语言，在 tessdata 目录下找 xxxx..traineddata

PHP 调用

php 复制代码

require '../composer/vendor/autoload.php'; // 加载 Composer 自动加载器
use thiagoalessio\TesseractOCR\TesseractOCR;

try{
    $tesseract = new TesseractOCR('./temp/t440_546.png');
    $tesseract->executable('D:/soft/jTessBoxEditor-2.5.0/tesseract-ocr/tesseract.exe'); 
    $tesseract->lang('luoma');//eng,osd,digits
    $text = $tesseract->run();
    echo "结果:".$text;
} catch (UnsuccessfulCommandException $e) {
    echo "识别图片 {$imagePath} 时发生错误：{$e->getMessage()}\n";
} catch (Exception $e) {
    // 捕获其他异常，并输出错误消息
    echo "发生未知错误：{$e->getMessage()}\n";    
}

训练

jTBE (jTessBoxEditor简称后同)

jTBE > Trainer 相关

Tesseract ExecuTables = D:/soft/jTessBoxEditor-2.5.0/tesseract-ocr/tesseract.exe

Training Data = E:\www\test\ocr\luoma.tiff #自己的临时工作目录

Language = luoma #自定义语言名称

Bootstrap Language = eng #附加哪个语言

jTBE > Tools > MergeTIFF 将多张图片合并到一个 .tiff 文件中,V4 引擎需要白底黑字

创建 box 文件，有两种方式 jTBE 和命令
1 jTBE > Trainer > Make Box File > Run
2 命令

php 复制代码

#第一个1.bat文件创建 box 文件, 其中 luoma 为自己定义的新语言
set ExePath=D:\soft\jTessBoxEditor-2.5.0\tesseract-ocr
set font=luoma
%ExePath%\tesseract %font%.tif %font% -l eng --psm 7 batch.nochop makebox
echo %font% 0 0 0 0 0 > font_properties
pause

jTBE > Box Editor 打开 xxx.tiff 文件，借助图片工具（如PS）框坐标与录文字

BOX文件格式 A 12 11 18 25 0 其中 0 表示在 tiff 中的第几页

BOX文件坐标与PS图片像素坐标的对应关系 PS:X=12 Y=10 Width=6 Heigth=14,图片高度ImgHeight=35,

规则 X,ImgHeight- Y -Height, X**+Width,ImgHeight-**Y = 12 11 18 25

创建 traineddata 文件，有两种方式 jTBE 和命令

1 jTBE > Trainer > Train With Existing Box > Run

php 复制代码

#第二个 2.bat 文件创建 luoma.traineddata
set ExePath=D:\soft\jTessBoxEditor-2.5.0\tesseract-ocr
set font=luoma
%ExePath%\tesseract %font%.tif %font% nobatch box.train
unicharset_extractor %font%.box
shapeclustering -F font_properties -U unicharset -O unicharset %font%.tr
mftraining -F font_properties -U unicharset %font%.tr
cntraining %font%.tr

rename normproto %font%.normproto
rename inttemp %font%.inttemp
rename pffmtable %font%.pffmtable
rename unicharset %font%.unicharset
rename shapetable %font%.shapetable

combine_tessdata %font%.
pause

将 luoma.traineddata 复制到 D:\soft\jTessBoxEditor-2.5.0\tesseract-ocr\tessdata\