【AI】RTX2060 6G Ubuntu 22.04.1 LTS (Jammy Jellyfish) 部署Chinese-LLaMA-Alpaca-2

下载源码

cd ~/Downloads/ai
git clone --depth=1 https://gitee.com/ymcui/Chinese-LLaMA-Alpaca-2

创建venv

python3 -m venv venv
source venv/bin/activate

安装依赖

 pip install -r requirements.txt

已安装依赖列表

(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ pip list
Package                  Version
------------------------ ----------
accelerate               0.26.1
bitsandbytes             0.41.1
certifi                  2023.11.17
charset-normalizer       3.3.2
cmake                    3.28.1
filelock                 3.13.1
fsspec                   2023.12.2
huggingface-hub          0.17.3
idna                     3.6
Jinja2                   3.1.3
lit                      17.0.6
MarkupSafe               2.1.3
mpmath                   1.3.0
networkx                 3.2.1
numpy                    1.26.3
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
packaging                23.2
peft                     0.3.0
pip                      22.0.2
psutil                   5.9.7
PyYAML                   6.0.1
regex                    2023.12.25
requests                 2.31.0
safetensors              0.4.1
sentencepiece            0.1.99
setuptools               59.6.0
sympy                    1.12
tokenizers               0.14.1
torch                    2.0.1
tqdm                     4.66.1
transformers             4.35.0
triton                   2.0.0
typing_extensions        4.9.0
urllib3                  2.1.0
wheel                    0.42.0

下载编译llama.cpp

cd ~/Downloads/ai/
git clone --depth=1 https://gh.api.99988866.xyz/https://github.com/ggerganov/llama.cpp
cd llma.cpp
make -j6

编译成功

创建软链接

cd ~/Downloads/ai/Chinese-LLaMA-Alpaca-2/scripts/llama-cpp/
ln -s ~/Downloads/ai/llama.cpp/main .

下载模型

由于只有6G显存,只下载基础的对话模型chinese-alpaca-2-1.3b

浏览器打开地址:hfl/chinese-alpaca-2-1.3b at main

放到~/Downloads/ai 目录下

启动chat报错

继续折腾:

这两个文件需要手动在浏览器内下载到~/Downloads/ai/chinese-alpaca-2-1.3b

参考文档

转换模型

rm models/ -rf
mkdir models
cp ~/Downloads/ai/chinese-alpaca-2-1.3b models/ -v
python ~/Downloads/ai/llama.cpp/convert.py models/chinese-alpaca-2-1.3b/

转换日志

(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ python ~/Downloads/ai/llama.cpp/convert.py models/chinese-alpaca-2-1.3b/
/home/yeqiang/Downloads/ai/llama.cpp/gguf-py
Loading model file models/chinese-alpaca-2-1.3b/pytorch_model.bin
params = Params(n_vocab=55296, n_embd=4096, n_layer=4, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, f_norm_eps=1e-05, n_experts=None, n_experts_used=None, rope_scaling_type=None, f_rope_freq_base=10000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('models/chinese-alpaca-2-1.3b'))
Loading vocab file 'models/chinese-alpaca-2-1.3b/tokenizer.model', type 'spm'
Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
model.embed_tokens.weight                        -> token_embd.weight                        | F16    | [55296, 4096]
model.layers.0.self_attn.q_proj.weight           -> blk.0.attn_q.weight                      | F16    | [4096, 4096]
model.layers.0.self_attn.k_proj.weight           -> blk.0.attn_k.weight                      | F16    | [4096, 4096]
model.layers.0.self_attn.v_proj.weight           -> blk.0.attn_v.weight                      | F16    | [4096, 4096]
model.layers.0.self_attn.o_proj.weight           -> blk.0.attn_output.weight                 | F16    | [4096, 4096]
skipping tensor blk.0.attn_rot_embd
model.layers.0.mlp.gate_proj.weight              -> blk.0.ffn_gate.weight                    | F16    | [11008, 4096]
model.layers.0.mlp.up_proj.weight                -> blk.0.ffn_up.weight                      | F16    | [11008, 4096]
model.layers.0.mlp.down_proj.weight              -> blk.0.ffn_down.weight                    | F16    | [4096, 11008]
model.layers.0.input_layernorm.weight            -> blk.0.attn_norm.weight                   | F16    | [4096]
model.layers.0.post_attention_layernorm.weight   -> blk.0.ffn_norm.weight                    | F16    | [4096]
model.layers.1.self_attn.q_proj.weight           -> blk.1.attn_q.weight                      | F16    | [4096, 4096]
model.layers.1.self_attn.k_proj.weight           -> blk.1.attn_k.weight                      | F16    | [4096, 4096]
model.layers.1.self_attn.v_proj.weight           -> blk.1.attn_v.weight                      | F16    | [4096, 4096]
model.layers.1.self_attn.o_proj.weight           -> blk.1.attn_output.weight                 | F16    | [4096, 4096]
skipping tensor blk.1.attn_rot_embd
model.layers.1.mlp.gate_proj.weight              -> blk.1.ffn_gate.weight                    | F16    | [11008, 4096]
model.layers.1.mlp.up_proj.weight                -> blk.1.ffn_up.weight                      | F16    | [11008, 4096]
model.layers.1.mlp.down_proj.weight              -> blk.1.ffn_down.weight                    | F16    | [4096, 11008]
model.layers.1.input_layernorm.weight            -> blk.1.attn_norm.weight                   | F16    | [4096]
model.layers.1.post_attention_layernorm.weight   -> blk.1.ffn_norm.weight                    | F16    | [4096]
model.layers.2.self_attn.q_proj.weight           -> blk.2.attn_q.weight                      | F16    | [4096, 4096]
model.layers.2.self_attn.k_proj.weight           -> blk.2.attn_k.weight                      | F16    | [4096, 4096]
model.layers.2.self_attn.v_proj.weight           -> blk.2.attn_v.weight                      | F16    | [4096, 4096]
model.layers.2.self_attn.o_proj.weight           -> blk.2.attn_output.weight                 | F16    | [4096, 4096]
skipping tensor blk.2.attn_rot_embd
model.layers.2.mlp.gate_proj.weight              -> blk.2.ffn_gate.weight                    | F16    | [11008, 4096]
model.layers.2.mlp.up_proj.weight                -> blk.2.ffn_up.weight                      | F16    | [11008, 4096]
model.layers.2.mlp.down_proj.weight              -> blk.2.ffn_down.weight                    | F16    | [4096, 11008]
model.layers.2.input_layernorm.weight            -> blk.2.attn_norm.weight                   | F16    | [4096]
model.layers.2.post_attention_layernorm.weight   -> blk.2.ffn_norm.weight                    | F16    | [4096]
model.layers.3.self_attn.q_proj.weight           -> blk.3.attn_q.weight                      | F16    | [4096, 4096]
model.layers.3.self_attn.k_proj.weight           -> blk.3.attn_k.weight                      | F16    | [4096, 4096]
model.layers.3.self_attn.v_proj.weight           -> blk.3.attn_v.weight                      | F16    | [4096, 4096]
model.layers.3.self_attn.o_proj.weight           -> blk.3.attn_output.weight                 | F16    | [4096, 4096]
skipping tensor blk.3.attn_rot_embd
model.layers.3.mlp.gate_proj.weight              -> blk.3.ffn_gate.weight                    | F16    | [11008, 4096]
model.layers.3.mlp.up_proj.weight                -> blk.3.ffn_up.weight                      | F16    | [11008, 4096]
model.layers.3.mlp.down_proj.weight              -> blk.3.ffn_down.weight                    | F16    | [4096, 11008]
model.layers.3.input_layernorm.weight            -> blk.3.attn_norm.weight                   | F16    | [4096]
model.layers.3.post_attention_layernorm.weight   -> blk.3.ffn_norm.weight                    | F16    | [4096]
model.norm.weight                                -> output_norm.weight                       | F16    | [4096]
lm_head.weight                                   -> output.weight                            | F16    | [55296, 4096]
Writing models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf, format 1
Ignoring added_tokens.json since model matches vocab size without it.
gguf: This GGUF file is for Little Endian only
gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type pad to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
[ 1/39] Writing tensor token_embd.weight                      | size  55296 x   4096  | type F16  | T+   1
[ 2/39] Writing tensor blk.0.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   1
[ 3/39] Writing tensor blk.0.attn_k.weight                    | size   4096 x   4096  | type F16  | T+   1
[ 4/39] Writing tensor blk.0.attn_v.weight                    | size   4096 x   4096  | type F16  | T+   1
[ 5/39] Writing tensor blk.0.attn_output.weight               | size   4096 x   4096  | type F16  | T+   1
[ 6/39] Writing tensor blk.0.ffn_gate.weight                  | size  11008 x   4096  | type F16  | T+   1
[ 7/39] Writing tensor blk.0.ffn_up.weight                    | size  11008 x   4096  | type F16  | T+   1
[ 8/39] Writing tensor blk.0.ffn_down.weight                  | size   4096 x  11008  | type F16  | T+   1
[ 9/39] Writing tensor blk.0.attn_norm.weight                 | size   4096           | type F32  | T+   2
[10/39] Writing tensor blk.0.ffn_norm.weight                  | size   4096           | type F32  | T+   2
[11/39] Writing tensor blk.1.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   2
[12/39] Writing tensor blk.1.attn_k.weight                    | size   4096 x   4096  | type F16  | T+   2
[13/39] Writing tensor blk.1.attn_v.weight                    | size   4096 x   4096  | type F16  | T+   2
[14/39] Writing tensor blk.1.attn_output.weight               | size   4096 x   4096  | type F16  | T+   2
[15/39] Writing tensor blk.1.ffn_gate.weight                  | size  11008 x   4096  | type F16  | T+   2
[16/39] Writing tensor blk.1.ffn_up.weight                    | size  11008 x   4096  | type F16  | T+   2
[17/39] Writing tensor blk.1.ffn_down.weight                  | size   4096 x  11008  | type F16  | T+   2
[18/39] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   2
[19/39] Writing tensor blk.1.ffn_norm.weight                  | size   4096           | type F32  | T+   2
[20/39] Writing tensor blk.2.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   2
[21/39] Writing tensor blk.2.attn_k.weight                    | size   4096 x   4096  | type F16  | T+   2
[22/39] Writing tensor blk.2.attn_v.weight                    | size   4096 x   4096  | type F16  | T+   2
[23/39] Writing tensor blk.2.attn_output.weight               | size   4096 x   4096  | type F16  | T+   2
[24/39] Writing tensor blk.2.ffn_gate.weight                  | size  11008 x   4096  | type F16  | T+   2
[25/39] Writing tensor blk.2.ffn_up.weight                    | size  11008 x   4096  | type F16  | T+   2
[26/39] Writing tensor blk.2.ffn_down.weight                  | size   4096 x  11008  | type F16  | T+   2
[27/39] Writing tensor blk.2.attn_norm.weight                 | size   4096           | type F32  | T+   2
[28/39] Writing tensor blk.2.ffn_norm.weight                  | size   4096           | type F32  | T+   2
[29/39] Writing tensor blk.3.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   2
[30/39] Writing tensor blk.3.attn_k.weight                    | size   4096 x   4096  | type F16  | T+   2
[31/39] Writing tensor blk.3.attn_v.weight                    | size   4096 x   4096  | type F16  | T+   2
[32/39] Writing tensor blk.3.attn_output.weight               | size   4096 x   4096  | type F16  | T+   2
[33/39] Writing tensor blk.3.ffn_gate.weight                  | size  11008 x   4096  | type F16  | T+   3
[34/39] Writing tensor blk.3.ffn_up.weight                    | size  11008 x   4096  | type F16  | T+   3
[35/39] Writing tensor blk.3.ffn_down.weight                  | size   4096 x  11008  | type F16  | T+   4
[36/39] Writing tensor blk.3.attn_norm.weight                 | size   4096           | type F32  | T+   4
[37/39] Writing tensor blk.3.ffn_norm.weight                  | size   4096           | type F32  | T+   4
[38/39] Writing tensor output_norm.weight                     | size   4096           | type F32  | T+   4
[39/39] Writing tensor output.weight                          | size  55296 x   4096  | type F16  | T+   4
Wrote models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf
(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ 

进一步对FP16模型进行4-bit量化

~/Downloads/ai/llama.cpp/quantize models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin q4_0

日志

(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ ~/Downloads/ai/llama.cpp/quantize models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin q4_0
main: build = 1 (5c99960)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf' to 'models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin' as Q4_0
llama_model_loader: loaded meta data with 21 key-value pairs and 39 tensors from models/chinese-alpaca-2-1.3b/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 4
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,55296]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,55296]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,55296]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:    9 tensors
llama_model_loader: - type  f16:   30 tensors
llama_model_quantize_internal: meta size = 1233920 bytes
[   1/  39]                    token_embd.weight - [ 4096, 55296,     1,     1], type =    f16, quantizing to q4_0 .. size =   432.00 MiB ->   121.50 MiB | hist: 0.037 0.016 0.026 0.039 0.057 0.077 0.096 0.110 0.116 0.110 0.096 0.077 0.057 0.039 0.026 0.021 
[   2/  39]                  blk.0.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.016 0.027 0.040 0.056 0.074 0.092 0.109 0.121 0.110 0.093 0.076 0.058 0.042 0.027 0.021 
[   3/  39]                  blk.0.attn_k.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.035 0.012 0.019 0.031 0.047 0.069 0.097 0.130 0.152 0.130 0.097 0.069 0.047 0.030 0.019 0.015 
[   4/  39]                  blk.0.attn_v.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.015 0.024 0.037 0.054 0.075 0.097 0.115 0.123 0.115 0.097 0.075 0.054 0.037 0.024 0.020 
[   5/  39]             blk.0.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.035 0.012 0.020 0.032 0.049 0.072 0.099 0.126 0.138 0.126 0.100 0.072 0.049 0.032 0.020 0.017 
[   6/  39]                blk.0.ffn_gate.weight - [ 4096, 11008,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[   7/  39]                  blk.0.ffn_up.weight - [ 4096, 11008,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
[   8/  39]                blk.0.ffn_down.weight - [11008,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[   9/  39]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  10/  39]                blk.0.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  11/  39]                  blk.1.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.013 0.021 0.033 0.050 0.072 0.098 0.123 0.137 0.123 0.098 0.072 0.050 0.033 0.021 0.017 
[  12/  39]                  blk.1.attn_k.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.013 0.021 0.033 0.050 0.073 0.098 0.123 0.136 0.123 0.099 0.073 0.051 0.033 0.021 0.017 
[  13/  39]                  blk.1.attn_v.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.097 0.114 0.122 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
[  14/  39]             blk.1.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[  15/  39]                blk.1.ffn_gate.weight - [ 4096, 11008,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  16/  39]                  blk.1.ffn_up.weight - [ 4096, 11008,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  17/  39]                blk.1.ffn_down.weight - [11008,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  18/  39]               blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  19/  39]                blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  20/  39]                  blk.2.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.015 0.024 0.037 0.054 0.075 0.097 0.116 0.125 0.116 0.097 0.075 0.054 0.037 0.024 0.020 
[  21/  39]                  blk.2.attn_k.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.015 0.024 0.037 0.054 0.075 0.097 0.116 0.126 0.116 0.097 0.075 0.054 0.037 0.024 0.019 
[  22/  39]                  blk.2.attn_v.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  23/  39]             blk.2.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  24/  39]                blk.2.ffn_gate.weight - [ 4096, 11008,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  25/  39]                  blk.2.ffn_up.weight - [ 4096, 11008,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[  26/  39]                blk.2.ffn_down.weight - [11008,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[  27/  39]               blk.2.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  28/  39]                blk.2.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  29/  39]                  blk.3.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.121 0.113 0.097 0.076 0.055 0.038 0.025 0.020 
[  30/  39]                  blk.3.attn_k.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
[  31/  39]                  blk.3.attn_v.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  32/  39]             blk.3.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    32.00 MiB ->     9.00 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  33/  39]                blk.3.ffn_gate.weight - [ 4096, 11008,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  34/  39]                  blk.3.ffn_up.weight - [ 4096, 11008,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  35/  39]                blk.3.ffn_down.weight - [11008,  4096,     1,     1], type =    f16, quantizing to q4_0 .. size =    86.00 MiB ->    24.19 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[  36/  39]               blk.3.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  37/  39]                blk.3.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  38/  39]                   output_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  39/  39]                        output.weight - [ 4096, 55296,     1,     1], type =    f16, quantizing to q6_K .. size =   432.00 MiB ->   177.19 MiB
llama_model_quantize_internal: model size  =  2408.14 MB
llama_model_quantize_internal: quant size  =   733.08 MB
llama_model_quantize_internal: hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.021 

main: quantize time =  5131.57 ms
main:    total time =  5131.57 ms

启动chat.sh

mv scripts/llama-cpp/main .
bash scripts/llama-cpp/chat.sh models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin

启动成功了,日志

(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ bash scripts/llama-cpp/chat.sh models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin 
Log start
main: build = 1 (5c99960)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1705481300
llama_model_loader: loaded meta data with 22 key-value pairs and 39 tensors from models/chinese-alpaca-2-1.3b/ggml-model-q4_0.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 4
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,55296]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,55296]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,55296]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:    9 tensors
llama_model_loader: - type q4_0:   29 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 889/55296 vs 259/55296 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 55296
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 1.26 B
llm_load_print_meta: model size       = 733.08 MiB (4.87 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.01 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/5 layers to GPU
llm_load_tensors:        CPU buffer size =   733.08 MiB
..............................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model: graph splits (measure): 1
llama_new_context_with_model:        CPU compute buffer size =   288.00 MiB

system_info: n_threads = 8 / 6 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Input prefix with BOS
Input prefix: ' [INST] '
Input suffix: ' [/INST]'
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.900, min_p = 0.050, typical_p = 1.000, temp = 0.500
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 [INST] <<SYS>>
You are a helpful assistant. 你是一个乐于助人的助手。
<</SYS>>

 [/INST] 您好,有什么我可以帮助您的吗?
 [INST] 

这是完全基于CPU实现的?

编译llama.cpp项目没有启动cuda?


试试web

参考资料

安装gradio

pip install gradio

报错

git下载模型,报错

手动把之前的模型拷贝进目录

启动gradio

安装xformers

(venv) yeqiang@yeqiang-MS-7B23:~/Downloads/ai/Chinese-LLaMA-Alpaca-2$ pip install xformers scipy

崩溃了。。。

github加速参考:

FAST-GitHub | Fast-GitHub

huggingface加速参考

hfl/chinese-alpaca-2-1.3b at main

相关推荐
tntlbb13 分钟前
Ubuntu20.4 VPN+Docker代理配置
运维·ubuntu·docker·容器
热心市民运维小孙16 分钟前
Ubuntu重命名默认账户
linux·ubuntu·excel
PyAIGCMaster18 分钟前
文本模式下成功。ubuntu P104成功。
服务器·数据库·ubuntu
Jackey_Song_Odd1 小时前
解决Ubuntu下无法装载 Windows D盘的问题
linux·ubuntu
又困又爱睡2 小时前
LLaMA-Factory(一)环境配置及包下载
llama
每天八杯水D2 小时前
LLaMA-Factory GLM4-9B-CHAT LoRA 微调实战
lora·微调·llama·peft·glm4-9b-chat
一只敲代码的猪2 小时前
Llama 3 模型系列解析(一)
大数据·python·llama
猫猫的小茶馆4 小时前
【数据结构】数据结构整体大纲
linux·数据结构·算法·ubuntu·嵌入式软件
shelby_loo4 小时前
使用 Docker 在 Ubuntu 下部署 Cloudflared Tunnel 服务器
服务器·ubuntu·docker
咏颜5 小时前
Ubuntu离线安装Docker容器
linux·运维·服务器·经验分享·ubuntu·docker