怎样用python把edge PDF document文件转换为TXT文件和docx文件？

要将PDF文件转换为TXT文件或docx文件，我建议你使用Python库来完成此任务。以下是一些常用的库和方法：

使用pdfminer库：

首先，你需要安装pdfminer库。可以使用以下命令安装：
复制代码
```
`pip install pdfminer.six
`
```
接下来，你可以使用下面的代码将PDF文件转换为TXT文件：

复制代码

  `from pdfminer.converter import TextConverter
  from pdfminer.pdfinterp import PDFPageInterpreter
  from pdfminer.pdfinterp import PDFResourceManager
  from pdfminer.pdfpage import PDFPage
  from io import StringIO

  def convert_pdf_to_txt(path):
      rsrcmgr = PDFResourceManager()
      codec = 'utf-8'
      outfp = StringIO()
      laparams = LAParams()
      device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
      with open(path, 'rb') as fp:
          interpreter = PDFPageInterpreter(rsrcmgr, device)
          for page in PDFPage.get_pages(fp, check_extractable=True):
              interpreter.process_page(page)
      text = outfp.getvalue()
      device.close()
      outfp.close()
      return text

  pdf_path = 'path/to/pdf/file.pdf'
  txt_path = 'path/to/txt/file.txt'
  text = convert_pdf_to_txt(pdf_path)
  with open(txt_path, 'w', encoding='utf-8') as file:
      file.write(text)
  `

使用pytesseract库：

首先，你需要安装pytesseract库和tesseract OCR引擎。可以使用以下命令安装：
复制代码
```
`pip install pytesseract`
```
还需要下载并安装tesseract OCR引擎，可以从以下链接获取：https://github.com/tesseract-ocr/tesseract/wiki

接下来，你可以使用下面的代码将PDF文件转换为TXT文件：

复制代码

`import pytesseract
from pdf2image import convert_from_path

def convert_pdf_to_txt(pdf_path, txt_path):
    images = convert_from_path(pdf_path)
    text = ''
    for i, image in enumerate(images):
        temp_file = f'temp_page_{i}.jpg'
        image.save(temp_file)
        text += pytesseract.image_to_string(temp_file)
        os.remove(temp_file)
    with open(txt_path, 'w', encoding='utf-8') as file:
        file.write(text)

pdf_path = 'path/to/pdf/file.pdf'
txt_path = 'path/to/txt/file.txt'
convert_pdf_to_txt(pdf_path, txt_path)`

使用python-docx库：

首先，你需要安装python-docx库。可以使用以下命令安装：
复制代码
```
`pip install python-docx`
```

接下来，你可以使用下面的代码将PDF文件转换为docx文件：

复制代码

`from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from io import StringIO
from docx import Document

def convert_pdf_to_docx(pdf_path, docx_path):
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    outfp = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
    with open(pdf_path, 'rb') as fp:
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp, check_extractable=True):
            interpreter.process_page(page)
    text = outfp.getvalue()
    device.close()
    outfp.close()
    
    doc = Document()
    doc.add_paragraph(text)
    doc.save(docx_path)

pdf_path = 'path/to/pdf/file.pdf'
docx_path = 'path/to/docx/file.docx'
convert_pdf_to_docx(pdf_path, docx_path)
`

请注意，上述代码中的路径需要根据实际的PDF文件路径和输出文件路径进行修改。