AI学习笔记：pdf-document-layout-analysis

一直在学AI，但没有连续的时间来尝试。现在终于失业了，有大把连续的时间来动手。

之前准备了一台I5-1400F+RTX3600 12G的电脑，现在终于派上用场了。

由于一直在从事无线通信相关的工作，所以，拿到一份很长的 AI可能与通信在哪些方面，能够结合的pdf文档。

所以，打算从这份文档开始入手。

第一个找到的项目的是这个：https://huggingface.co/HURIDOCS/pdf-document-layout-analysis

PDF Document Layout Analysis

Models for extracting segments alongside with their types from a PDF

In this model card, we are providing the non-visual models we use in our pdf-document-layout-analysis service:

复制代码

https://github.com/huridocs/pdf-document-layout-analysis

This service allows for the segmentation and classification of different parts of PDF pages, identifying the elements such as texts, titles, pictures, tables and so on. Additionally, it determines the correct order of these identified elements.

尽管从安装到使用，只有这么多：

Quick Start

Clone the service:

复制代码

git clone https://github.com/huridocs/pdf-document-layout-analysis.git
cd pdf-document-layout-analysis

Start the service:

复制代码

make start

Get the segments of a PDF:

复制代码

# With visual models
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060

# With non-visual models [with the models in this model card]
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' -F "fast=true" localhost:5060

To stop the server:

复制代码

make stop

但是，实际我整用了两个连续的整天时间，才走到最后一步。

这当然90%要感谢我们的网络了。

要将所有的bash, apt, docker, 还有什么，统统都要科学上网才行。这个就不多说了。

唉，真是派费了大家太多精力。

因为要用的梯子太多了，这些我就不写在这里了，总之大家知道就行了。

然后就是按这个说即可。

要注意点其实有很多，比如，需要装cuda与docker的toolkit. 就是docker的转接桥。

然后Dockerfile要改，在From后，加入proxy,比如：

FROM pytorch/pytorch:2.3.1-cuda11.8-cudnn8-runtime

ENV http_proxy http://192.168.1.7:8089

ENV https_proxy http://192.168.1.7:8089

其它的并没有什么。

但是实际上，我做的时候，是一点点推进的。

就是将DockerFile中，先少做一写行。

但这个不必要。

最困难是最后一步，

复制代码

services:
  pdf-document-layout-analysis:
    container_name: pdf-document-layout-analysis
    entrypoint: [ "gunicorn", "-k", "uvicorn.workers.UvicornWorker", "--chdir", "./src", "app:app", "--bind", "0.0.0.0:5060", "--timeout", "10000"]
    init: true
    restart: unless-stopped
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "5060:5060"

我将这句改成：

entrypoint: $"gunicorn", "-k", "uvicorn.workers.UvicornWorker", "--chdir", "./src", "app:app", "--bind", "0.0.0.0:5060", "--timeout", "10000"$

tail -f /dev/null

然后启动，一点点来重作。

python src/download_models.py

和

手工来启动：

gunicorn -k uvicorn.workers.UvicornWorker --chdir ./src app:app --bind 0.0.0.0:5060 --timeout 10000

实际上， dockerfile中的每一句都很困难，靠当时的网络是不是很好。

所以，一定要改好docker-compose，将本机的目录，映射到docker中去，省得不停地下载。

先尽可能使得container 能启动，然后手工一个个的做。

例如这一步：COPY ./models/. ./models/

其实很让人困惑的。

因为

git clone

https://github.com/huridocs/pdf-document-layout-analysis

之后，并没有这个目录，所以，我们需要手工建一个空的目录，放在那里。

但是为什么要这么做呢？因为这一步：

python src/download_models.py

可能需要下载许多dataset，这种动作，只需要做一次就行了。

也就是说，不一定非要在container 中来调用这个python 脚本。

如果你手工下载好了，放到那个目录，然后build dockerfile过程，就快多了。

因为 src/download_models.py还是比较智能，如果下载成功，它不会再执行。这个比较好。

最后一步的container的entrypoint ,可以在container启动后，自己来执行：

命令这里我给大家改好了，当然要注意是在container 的/app 所在位置执行。

gunicorn -k uvicorn.workers.UvicornWorker --chdir ./src app:app --bind 0.0.0.0:5060 --timeout 10000

这个过程，同样需要下载很多数据。

所以，每一步手工来执行，是很重要的。因为我们这个网络就是这个样子，随时能拉了。

然后，如果每次都重新make start一次，可能会让人非常难受。

启动后就很简单了，

比如这样：

curl -X POST -F 'file=@/app/tests/6G_phy_AI_keypoint.pdf' localhost:5060

curl -X POST -F 'file=@/app/tests/test.pdf' localhost:5060

curl -X POST -F file='@/app/tests/test.pdf' -F "fast=true" localhost:5060

注意是，上面这类指令，是在主机中执行的。所以，一定要把主机的当前的bash 的梯子撤了，或者你有什么办法，作个例外。否则localhost，也转了。当然是我设置的不对了。应该配置个noproxy之类的东西。

最困难的几步的log在这里：

https://download.csdn.net/download/haoyujie/89639546

可能也没有什么用。

至于功能，我还在分析，里面dependcies的许多库，来自于许多方面，其中业务部分，是阿里的，这里还是有一点小的骄傲吧，我们中国还成啊。基础就多了，facebook, MS,等等。

虽然没学到什么东西，算是跑起来了，很快。我这里只是３０６０普通卡，12G的。

能够感受的是，一个应用的产品，是如何构建出来的。

另外，我也尝试了在本机装，不用docker,没有成功。可能是torch的库过新。docker怎么说呢，又恨又爱吧。我的建议是能不用就不用，但的确有时候，也没有办法。一半一半吧。

总结一下：

１。所有的都要装梯子。

2。需要注意安装cuda与docker contain桥。

3。 docker file中需要指定梯子。这步相当重要。

4。将docker-compose中的entrypoint改成一个一定能成功，而且会hung在那的。

5。上面忘了，要用 -u 指令，用root登入后，才能装软件，比如sudo, vim等。这是因为，如果你不想让 download.py工作，就要把它的__main__中的内容注掉。container启动后，还得还原。我sed可没那么熟。

6。 setup, download,start web service, 这些见上面提到 log

试用。试用前关梯子。