不可复制的PDF转成双层可复制PDF

有些PDF是通过扫描或者虚拟打印机生成的，这些PDF不可复制里边的内容

市面上的工具一般都是收费或者有水印，所以就萌生了自己搞一个的想法：

使用了以下三个开源库

PdfiumViewer PDF预览及可编辑PDF的提取
PDFsharp 生成PDF
PaddleSharp 对图片OCR识别

大概思路是：

用PdfiumViewer 渲染显示，并转PDF为图片；

使用PaddleSharp 对提取图片的内容及bbox坐标；

把坐标根据缩放比转成相对于PDF的坐标，并使用PDFsharp 重新生成PDF，如需要保持原有格式需要把1转成的图片重新回写到生成的pdf，文字层为ocg层；

实现双层pdf的效果；

读取PDF到到内存流：

复制代码

  private PdfDocument OpenDocument(string fileName)
  {
      try
      {
          return PdfDocument.Load(this, new MemoryStream(File.ReadAllBytes(fileName)));
      }
      catch (Exception ex)
      {
          MessageBox.Show(this, ex.Message, Text, MessageBoxButtons.OK, MessageBoxIcon.Error);
          return null;
      }
  }

PDF转图片：

复制代码

//图片dpi要相对大些，这样ocr识别的更清晰
int dpiX = 96 * 5;
 int dpiY = 96 * 5;
 var pdfWidth = (int)document.PageSizes[page].Width * 4 / 3;
 var pdfHeight = (int)document.PageSizes[page].Height * 4 / 3;
 var rotate = PdfRotation.Rotate0;
 var flags = PdfRenderFlags.Annotations | PdfRenderFlags.CorrectFromDpi;
 using (var image = document.Render(page, pdfWidth, pdfHeight, dpiX, dpiY, rotate, flags)){}

OCR识别：

复制代码

  byte[] sampleImageData = ImageToByte(image);
  FullOcrModel model = LocalFullModels.ChineseV3;
  using (PaddleOcrAll all = new PaddleOcrAll(model, PaddleDevice.Mkldnn())
  {
      AllowRotateDetection = true, /* 允许识别有角度的文字 */
      Enable180Classification = false, /* 允许识别旋转角度大于90度的文字 */
  })
  {
     // Load local file by following code:
     using (Mat src = Cv2.ImDecode(sampleImageData, ImreadModes.Color))
      {
          PaddleOcrResult result = all.Run(src);
          return result;
      }
  }

图片转成流：

复制代码

 private byte[] ImageToByte(System.Drawing.Image image)
 {
     MemoryStream ms = new MemoryStream();
     if (image == null)
         return new byte[ms.Length];
     image.Save(ms, System.Drawing.Imaging.ImageFormat.Png);
     byte[] BPicture = new byte[ms.Length];
     BPicture = ms.GetBuffer();
     return BPicture;
 }

转换bbox坐标到PDF坐标：bbox坐标是相对于图片的坐标，所以缩放比是pdf和图片的之间的转换，图片大小变化，此函数需跟着调整

复制代码

 private System.Drawing.RectangleF ConvertToPDFSize(System.Drawing.RectangleF rectangle, float dpiX, float dpiY, PdfRotation rotate, PdfRenderFlags flags)
 {
     var width = rectangle.Width;
     var height = rectangle.Height;
     var x = rectangle.X;
     var y = rectangle.Y;
     if ((flags & PdfRenderFlags.CorrectFromDpi) != 0)
     {
         width = (width / dpiX * 72);
         height = (height / dpiY * 72);
         x = (x / dpiX * 72);
         y = (y / dpiY * 72);
     }
     return new RectangleF(x, y, width, height);
 }

bbox坐标框选示例：

由于OCR的限制：转双层pdf只能在x64系统运行，不使用OCR可运行x86和x64，

工具功能：

可提取和框选提取可复制和不可复制pdf；
可转换不可复制的pdf为双层可复制pdf；
可转换不可复制的pdf为可复制pdf；
加载图片并提取标注提取内容；

直接下载地址：
https://cloud.189.cn/web/share?code=MZvMb2ZNRbMn（访问码：y9st）
也可按照下放代码自己编译使用，遵循MIT协议
欢迎Start、PR
源码地址：https://github.com/1000374/HM.PdfOcr