【全网首发】Meta刚发布的Llama3测试体验

就在今天凌晨，万众期待的 Llama 3 就发布了。我一大早赶集似的就去申请Llama 3，申请也比较简单，问你姓名，地区和联系方式就这些，等了一会儿就通过了。而此次 Llama 3 只有8B，8B-Instruct,70B 和 70B-Instruct。据说还有其他的版本需要等到之后的时间安排才能发布。反正对我而言勉强都能跑，也大差不差。

Llama 3 库

我原本更新了Llama的Github库，但是据官方介绍，此次独立了一个新的Llama 3的库

代码运行

记得先更新transformers pip install transformers --upgrade

8B（Huggingface CUDA）

ini 复制代码

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda", // 官方的推荐 auto, 运行会提示错误让你选择，我选择cuda，测试 cpu 运行 70B巨慢
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])
/**
输出：Arrr, shiver me timbers! Me name be Captain Chat, the scurviest pirate chatbot to ever sail the Seven Seas! I be here to swab the decks with me trusty keyboard and answer yer questions, savvy? So hoist the colors and let's set sail fer a swashbucklin' good time, matey!
*/

DeepL翻译：啊，我的心在颤抖！我是聊天船长七大洋上最卑鄙的海盗聊天机器人我在这里用我可靠的键盘敲打甲板回答你们的问题，明白吗？那就升起旗帜，让我们扬帆起航，享受一段海盗的美好时光，伙计们！

70B llama.cpp

推荐 MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF（huggingface）

huggingface-cli download MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF --local-dir . --include 'Q2_Kgguf'

makefile 复制代码

$ llama.cpp/main -m Meta-Llama-3-70B-Instruct.Q5_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 1024 -e
Log start
main: build = 2647 (8228b66d)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1713504125
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from Meta-Llama-3-70B-Instruct.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hub
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32              = 80
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 17
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name     = hub
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors:        CPU buffer size = 47628.36 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   266.50 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 1024, n_keep = 0


Building a website can be done in 10 simple steps:
Step 1: Choose a Domain Name
Step 2: Register Your Domain Name
Step 3: Choose a Web Host
Step 4: Plan Your Website
Step 5: Design Your Website
Step 6: Build Your Website
Step 7: Add Content to Your Website
Step 8: Launch Your Website
Step 9: Test and Refine Your Website
Step 10: Maintain and Update Your Website

Here are some additional tips to help you create a successful website:

* Make sure your website is mobile-friendly, as more and more users are accessing the internet through their mobile devices.
* Use search engine optimization (SEO) techniques to improve your website's visibility in search engines.
* Use social media to promote your website and engage with your audience.
* Regularly update your website with fresh content to keep your audience engaged.
* Use analytics tools to track your website's performance and make data-driven decisions.
* Make sure your website is secure and protected from cyber threats.

Remember, creating a successful website takes time and effort, but with the right planning and execution, it can be a valuable asset for your business or personal brand.assistant

Here are some additional tips to help you create a successful website:

1. **Define your target audience**: Understand who your website is for and what they are looking for. This will help you create content and design that resonates with them.
2. **Keep it simple and intuitive**: Make it easy for users to navigate your website and find what they're looking for.
3. **Use high-quality images and graphics**: Visual content can help engage users and make your website more appealing.
4. **Write compelling and concise content**: Use clear and concise language to communicate your message and keep users engaged.
5. **Make it fast and responsive**: Ensure that your website loads quickly and is optimized for different devices and screen sizes.
6. **Use calls-to-action (CTAs)**: Encourage users to take action by using clear and prominent CTAs.
7. **Use analytics and tracking tools**: Monitor your website's performance and make data-driven decisions to improve it.
8. **Keep it up-to-date and fresh**: Regularly update your website with new content and features to keep users engaged and coming back.
9. **Make it secure**: Ensure that your website is secure and protected from cyber threats by using HTTPS and keeping your software up-to-date.
10. **Test and iterate**: Contin and ** to to that to to ** that that **** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **** **** ** ****
 ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** **** ** ** ** ** ** ** ** ** ** ** ** **** ** **** **** ** ** ** ** **** ** ******** ** **** ** **
** ** ** ** ** ** **** ** ****** ** ** ** ** ** ** ** **** ** ** **** ** **** **
 ** ** ** ** **** ** ** ** ** ** **** ** ** ********.
** ******** **** ** ** **** **** ** ** **** **** **** ** ** **.
.
 ** **** **

llama_print_timings:        load time =   16648.95 ms
llama_print_timings:      sample time =      37.87 ms /   691 runs   (    0.05 ms per token, 18248.56 tokens per second)
llama_print_timings: prompt eval time =    5782.04 ms /    17 tokens (  340.12 ms per token,     2.94 tokens per second)
llama_print_timings:        eval time =  683846.34 ms /   690 runs   (  991.08 ms per token,     1.01 tokens per second)
llama_print_timings:       total time =  690448.33 ms /   707 tokens

速度慢，我没有跑到底，但是感觉鲁棒性似乎和Mistral家的不太一样。

makefile 复制代码

$ llama.cpp/main -m Meta-Llama-3-70B-Instruct.Q5_K_M.gguf -p "上海是在一座" -n 256 -e
Log start
main: build = 2647 (8228b66d)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1713504870
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from Meta-Llama-3-70B-Instruct.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hub
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32              = 80
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 17
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name     = hub
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors:        CPU buffer size = 47628.36 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   266.50 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 0


上海是在一座小岛上，整个小岛被一层柔和的绿色光芒所笼罩着，岛上遍布着各种奇特的生物和植物，整个小岛充满着浓郁的神秘色彩。


在小岛的中心有一座高大的古树，树干上刻着一些古老的符号，符号似乎在发出着微弱的光芒，照亮着周围的环境。


小岛的主人是一个名叫 "莉莉丝" 的少女，她有着一头银发和一双碧绿色的眼睛，她总是穿着一件白色的长裙，裙子上绣着一些精美的花纹。她拥有着神奇的力量，可以控制小岛上的各种生物和植物，整个小岛都是她魔法的世界。


莉莉丝是一个温柔、善良的人，她总是帮助小岛上的生物和来访的客人，她的魔法也总是为了帮助他人和保护小岛。


然而，小岛上也隐藏着一些秘

llama_print_timings:        load time =    1866.31 ms
llama_print_timings:      sample time =      13.83 ms /   233 runs   (    0.06 ms per token, 16847.43 tokens per second)
llama_print_timings: prompt eval time =    1577.67 ms /     4 tokens (  394.42 ms per token,     2.54 tokens per second)
llama_print_timings:        eval time =  229418.17 ms /   232 runs   (  988.87 ms per token,     1.01 tokens per second)
llama_print_timings:       total time =  231358.70 ms /   236 tokens

这里我打错了，但是Llama 3非常Nice的，编撰了一段小说剧情，对"魔都"挺有想象力的。

makefile 复制代码

$ llama.cpp/main -m Meta-Llama-3-70B-Instruct.Q5_K_M.gguf -p "上海是一座" -n 256 -e
Log start
main: build = 2647 (8228b66d)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1713505108
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from Meta-Llama-3-70B-Instruct.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hub
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32              = 80
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 17
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name     = hub
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors:        CPU buffer size = 47628.36 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   266.50 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 0


上海是一座城市的名称)
* The city of Shanghai is a metropolis. (上海是一个大都市)
* The city of Shanghai is situated at the mouth of the Yangtze River. (上海位于长江口)
* Shanghai is a global financial center. (上海是一个全球金融中心)
* Shanghai is a city with a rich cultural heritage. (上海是一个文化遗产丰富的城市)

Note that in English, it's common to use "Shanghai" as a standalone noun to refer to the city, whereas in Chinese, "上海" is typically used with a preceding noun or phrase to clarify what is being referred to.

In Chinese, when referring to a specific aspect of the city, you would typically use a phrase like "上海市" (Shanghai city) or "上海都市" (Shanghai metropolis). For example:

* 上海市是中国最大的城市。 (Shanghai city is the largest city in China.)
* 上海都市的经济发展很快。 (The economy of Shanghai metropolis is developing very quickly.)

However, when referring to the city in a more general sense, it's common to simply use "上海" on its own. For example:

* 我喜欢上海。 (I like Shanghai.)
* 上海是一个很
llama_print_timings:        load time =    1894.65 ms
llama_print_timings:      sample time =      14.08 ms /   256 runs   (    0.06 ms per token, 18180.53 tokens per second)
llama_print_timings: prompt eval time =    1308.29 ms /     3 tokens (  436.10 ms per token,     2.29 tokens per second)
llama_print_timings:        eval time =  251453.62 ms /   255 runs   (  986.09 ms per token,     1.01 tokens per second)
llama_print_timings:       total time =  252938.58 ms /   258 tokens

啊？中英混合生成。他可是支持30多种语言的，这岂不是说比肩专业翻译。具体的估计仍需要同行来实测给出Benchmark，但我觉得 Llama 3 绝对会给开源 AI 注入一剂"内啡肽"，让我们拭目以待！