内网环境使用Docker部署Qwen2模型

背景介绍

在我参与的一个国企项目中,我们基于大语言模型开发了一些应用,但是甲方公司所有的资源环境都是纯内网。更为有趣的是,甲方公司已自主搭建并运行着一套百度机器学习平台(BML),客户要求所有的大模型部署必须依托于现有的BML平台进行,而非独立构建全新的基础设施,资源申请也相当严苛。面对这一系列限定条件,我们只能试着利用Docker容器技术进行大语言模型的部署。

前期准备

1、首先,内网环境部署docker:

这部分内容不再赘述,可参考之前写的教程。

bash 复制代码
https://zyn1994.blog.csdn.net/article/details/109516191

2、其次,使用一台具备网络环境的设备,拉取ollama的基础镜像:

bash 复制代码
docker pull ollama/ollama:latest
# 如果拉取不到,可使用下面这个
docker pull dhub.kubesre.xyz/ollama/ollama:latest

3、下载Qwen2的GGUF模型,这里为了演示方便就下载0.5B的模型了。

下载地址:https://huggingface.co/Qwen/Qwen2-0.5B-Instruct-GGUF​或者https://modelscope.cn/models/qwen/Qwen2-0.5B-Instruct-GGUF

4、编写Modelfile文件:

bash 复制代码
# 注意GGUF模型文件的地址要与Dockerfile中保持一致
FROM /tmp/qwen2-0_5b-instruct-q4_0.gguf
TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

最终得到GGUF模型文件和Modelfile文件。

bash 复制代码
-rw-r--r--. 1 root root       290 Jun 21 14:00 Modelfile
-rw-r--r--. 1 root root 352969408 Jun 21 13:44 qwen2-0_5b-instruct-q4_0.gguf

构建镜像

1、将先前拉取的基础镜像导入内网设备,然后编写Dockerfile文件:

bash 复制代码
FROM ollama:latest
EXPOSE 11434

ADD Modelfile /tmp/Modelfile
ADD qwen2-0_5b-instruct-q4_0.gguf /tmp/qwen2-0_5b-instruct-q4_0.gguf

ENTRYPOINT ["sh","-c","/bin/ollama serve"]

2、构建docker镜像,执行docker build -t ollama_qwen2-0_5b:1.0 -f Dockerfile .​:

bash 复制代码
(base) [root@localhost docker-qwen2]# docker build -t ollama_qwen2-0_5b:1.0 -f Dockerfile .
[+] Building 1.7s (8/8) FINISHED                                                                                                                                                docker:default
 => [internal] load .dockerignore                                                                                                                                                         0.4s
 => => transferring context: 2B                                                                                                                                                           0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                      0.5s
 => => transferring dockerfile: 303B                                                                                                                                                      0.0s
 => [internal] load metadata for docker.io/library/ollama:latest                                                                                                                          0.0s
 => [1/3] FROM docker.io/library/ollama:latest                                                                                                                                            0.0s
 => [internal] load build context                                                                                                                                                         0.1s
 => => transferring context: 201B                                                                                                                                                         0.0s
 => CACHED [2/3] ADD Modelfile /tmp/Modelfile                                                                                                                                             0.0s
 => CACHED [3/3] ADD qwen2-0_5b-instruct-q4_0.gguf /tmp/qwen2-0_5b-instruct-q4_0.gguf                                                                                                     0.0s
 => exporting to image                                                                                                                                                                    0.1s
 => => exporting layers                                                                                                                                                                   0.0s
 => => writing image sha256:a6a949928f9bffffe1fbc5ee2c1002bd76afd9a9579dc10c6598faebb57a4885                                                                                              0.0s
 => => naming to docker.io/library/ollama_qwen2-0_5b:1.0 

运行容器

1、创建并运行容器,执行docker run -itd --name ollama_qwen2 -p 11434:11434 ollama_qwen2-0_5b:1.0

bash 复制代码
(base) [root@localhost docker-qwen2]# docker run -itd --name ollama_qwen2 -p 11434:11434 ollama_qwen2-0_5b:1.0
b034390bf79ceca1ec67bb4f9898c930c2a6efe8260bb8ba0fcbe5ffd2634f1a

2、验证docker容器是否执行成功:

bash 复制代码
(base) [root@localhost docker-qwen2]# docker ps -a
CONTAINER ID   IMAGE                     COMMAND                  CREATED         STATUS            PORTS                                             NAMES
0341f573e41f   ollama_qwen2-0_5b:1.0     "sh -c '/bin/ollama ..."   6 seconds ago   Up 3 seconds      0.0.0.0:11434->11434/tcp, :::11434->11434/tcp     ollama_qwen2

到这里,我们已经部署好了docker版的ollama。这时ollama里并没有运行任何的模型,还需要我们进入容器创建加载一下。

加载Qwen2模型

1、首先进入我们刚刚运行的容器:

bash 复制代码
docker exec -it 0341f573e41f /bin/bash

2、执行ollama create​命令,创建及加载Qwen2模型:

bash 复制代码
root@0341f573e41f:/# ollama create qwen:0.5b -f /tmp/Modelfile
transferring model data
using existing layer sha256:aca679832ded61145239ce7f5c5ebddb1c57ada786c9c23733899c3888e0596f
creating new layer sha256:62fbfd9ed093d6e5ac83190c86eec5369317919f4b149598d2dbb38900e9faef
creating new layer sha256:f02dd72bb2423204352eabc5637b44d79d17f109fdb510a7c51455892aa2d216
creating new layer sha256:a82ce90cbc26a4c59e0985d2bceffa1d2616a1579218ff3cb656a3252cafdcc0
writing manifest
success

3、以上内容显示Qwen2模型已经成功在ollama中运行,然后输入exit​退出容器即可。

问答测试

1、基于我自己写的open-ai-java的框架,访问Ollama服务的代码:

bash 复制代码
# 访问代码
public static void test0(){
    OpenAIChat openAIChat = OpenAIChat.builder()
            .endpointUrl("http://10.8.xxx.xxx:11434/v1")
            .model("qwen:0.5b")
            .build().init();
    String stringFlux = openAIChat.chat("0dbe1580-60ae-4440-9462-df0a8f629f2c","你好");
    System.out.println(stringFlux);
}

# Idea中的响应日志
17:05:32.370 [main] INFO com.xxx.openai.llms.OpenAIChat - OpenAI 请求参数: {top_p=0.78, max_tokens=20000, temperature=0.9, messages=[Message(role=user, content=你好)], model=qwen:0.5b}
17:05:34.547 [main] INFO com.xxx.openai.llms.OpenAIChat - OpenAI 处理成功 响应结果为:
您好!有什么我可以帮助您的吗?
{"id":"chatcmpl-675","object":"chat.completion","created":1718960749,"model":"qwen:0.5b","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"您好!有什么我可以帮助您的吗?"},"finish_reason":"stop"}],"usage":{"prompt_tokens":9,"completion_tokens":9,"total_tokens":18}}

2、查询docker容器中的日志,可以看到服务运行良好:

bash 复制代码
(base) [root@localhost docker-qwen2]# docker logs -f 0341f573e41f
Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAID82w1ycq7Ej68YHdCJHVPh4xOy09uzzzCk2c9hvJcvg

2024/06/21 09:00:37 routes.go:1060: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-06-21T09:00:37.756Z level=INFO source=images.go:725 msg="total blobs: 0"
time=2024-06-21T09:00:37.756Z level=INFO source=images.go:732 msg="total unused blobs removed: 0"
time=2024-06-21T09:00:37.756Z level=INFO source=routes.go:1106 msg="Listening on [::]:11434 (version 0.1.45)"
time=2024-06-21T09:00:37.757Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1087015210/runners
time=2024-06-21T09:00:40.123Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60101]"
time=2024-06-21T09:00:40.125Z level=INFO source=types.go:98 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.6 GiB" available="41.1 GiB"
[GIN] 2024/06/21 - 09:02:27 | 404 |    2.961658ms |     10.8.10.196 | POST     "/v1/chat/completions"
[GIN] 2024/06/21 - 09:04:31 | 200 |      20.105µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/21 - 09:05:05 | 200 |      13.876µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/21 - 09:05:06 | 201 |  854.761714ms |       127.0.0.1 | POST     "/api/blobs/sha256:aca679832ded61145239ce7f5c5ebddb1c57ada786c9c23733899c3888e0596f"
[GIN] 2024/06/21 - 09:05:08 | 200 |   1.59888621s |       127.0.0.1 | POST     "/api/create"
time=2024-06-21T09:05:48.013Z level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[41.1 GiB]" memory.required.full="662.7 MiB" memory.required.partial="0 B" memory.required.kv="24.0 MiB" memory.required.allocations="[662.7 MiB]" memory.weights.total="217.0 MiB" memory.weights.repeating="79.1 MiB" memory.weights.nonrepeating="137.9 MiB" memory.graph.full="298.5 MiB" memory.graph.partial="405.0 MiB"
time=2024-06-21T09:05:48.013Z level=INFO source=server.go:359 msg="starting llama server" cmd="/tmp/ollama1087015210/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-aca679832ded61145239ce7f5c5ebddb1c57ada786c9c23733899c3888e0596f --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 36844"
time=2024-06-21T09:05:48.047Z level=INFO source=sched.go:382 msg="loaded runners" count=1
time=2024-06-21T09:05:48.047Z level=INFO source=server.go:547 msg="waiting for llama runner to start responding"
time=2024-06-21T09:05:48.048Z level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="7c26775" tid="139630217590656" timestamp=1718960748
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139630217590656" timestamp=1718960748 total_threads=8
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="36844" tid="139630217590656" timestamp=1718960748
llama_model_loader: loaded meta data with 26 key-value pairs and 290 tensors from /root/.ollama/models/blobs/sha256-aca679832ded61145239ce7f5c5ebddb1c57ada786c9c23733899c3888e0596f (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = qwen2-0_5b-instruct
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2024-06-21T09:05:48.299Z level=INFO source=server.go:585 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = ../Qwen2/gguf/qwen2-0_5b-imatrix/imat...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = ../sft_2406.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 168
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 1937
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  165 tensors
llama_model_loader: - type q4_1:    3 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: special tokens cache size = 293
llm_load_vocab: token to piece cache size = 0.9338 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 494.03 M
llm_load_print_meta: model size       = 330.95 MiB (5.62 BPW)
llm_load_print_meta: general.name     = qwen2-0_5b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =   330.95 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    24.00 MiB
llama_new_context_with_model: KV self size  =   24.00 MiB, K (f16):   12.00 MiB, V (f16):   12.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   298.50 MiB
llama_new_context_with_model: graph nodes  = 846
llama_new_context_with_model: graph splits = 1
INFO [main] model loaded | tid="139630217590656" timestamp=1718960748
time=2024-06-21T09:05:48.801Z level=INFO source=server.go:590 msg="llama runner started in 0.75 seconds"
[GIN] 2024/06/21 - 09:05:49 | 200 |  1.930408217s |     10.8.10.196 | POST     "/v1/chat/completions"

相关推荐
TsengOnce1 小时前
Docker 安装 禅道-21.2版本-外部数据库模式
运维·docker·容器
无为扫地僧2 小时前
三、ubuntu18.04安装docker
ubuntu·docker
谷莠子9053 小时前
hadoop实验之创业有感
hadoop·docker·团队开发
G丶AEOM3 小时前
Docker快速入门
docker
大熊程序猿4 小时前
airflow docker 安装
运维·docker·容器
带电的小王5 小时前
Docker在Ubuntu上安装
ubuntu·docker
fanruitian5 小时前
docker 为单个容器设置代理
运维·docker·容器
梁萌6 小时前
Docker快速安装Tomcat
docker·容器·tomcat·镜像
Doker 多克7 小时前
IntelliJ IDEA Docker集成
spring cloud·docker·intellij-idea
筏镜14 小时前
调整docker bridge地址冲突,通过bip调整 bridge地址
java·docker·eureka