参考网络资料,基于mlx-lm(0.28.1),尝试运行近期热门的轻量级模型gemma-3-270m-it-bf16,其原始模型为Gemma 3 270M拥有2.7亿个参数,只需极少额外训练即可定制,专为特定任务微调而构建。
1 gemma-3
1.1 gemma-3
gemma-3是google开发的轻量级LLM多模态模型,提供1B、4B、12B、27B等多个版本,分别基于2T、4T、12T、24T token语料训练。gemma-3可理解140多种语言,支持输入视频输入和输出文本,支持结构化输出和函数调用,其27B的模型能力甚至能匹敌Deepseek V3,o1-prereview,o3-min-high。
Gemma 3 marks a decisive leap forward in open vision-language models. The 27B instruction-tuned variant (Gemma-3-27B-IT) delivers a 1338 Elo score on LMSys Chatbot Arena, ranking in the top 10 across all models and outperforming most open-weight peers---including DeepSeek-V3, Qwen2.5-72B, and even Meta's 405B LLaMA 3.1. On standard benchmarks, it posts strong results: 67.5 on MMLU-Pro, 29.7 on LiveCodeBench, and 54.4 on Bird-SQL, showing robust reasoning and coding ability. Its 89.0 on MATH and 74.9 on FACTS Grounding reflect precision in symbolic tasks and factual alignment. This is enabled by a novel post-training pipeline blending distillation, RLHF (with reward models like BOND and WARP), and extensive multilingual tuning across 140+ languages using a 262K-token SentencePiece tokenizer. Architecturally, Gemma 3 introduces efficient long-context handling (up to 128K tokens) through RoPE scaling and 5:1 local-to-global attention layering---cutting KV cache memory by up to 85% vs. global-only designs without hurting perplexity. Multimodal inputs are powered by a frozen 400M SigLIP vision encoder and enhanced at inference with Pan & Scan, helping Gemma 3 excel on real-world image tasks (e.g., +17 points on InfoVQA with P&S). Its release spans 1B to 27B dense models with instruction-tuned and pre-trained variants---all deployable via Hugging Face, MLX, or llama.cpp. With day-zero support across tooling, near-SOTA performance, and strong safety benchmarks, Gemma 3 is a high-performing, accessible alternative to Gemini 1.5-Pro and a defining model in the open frontier.
https://deepranking.ai/llm-models/gemma-3-27b
1.2 Gemma 3 270M
Gemma 3 270M是一个精简的基础模型,拥有2.7亿个参数,专为特定任务的高效微调而构建。拥有强大指令跟踪能力和文本结构化能力,只需极少额外训练即可部署和定制。
Gemma 3 270M专为适配定制用例设计,认为效率比模型能力更重要。这种特性对设备端AI、隐私敏感、合规要求高等效率偏好行、要求定制和本地部署的任务至关重要。
https://huggingface.co/google/gemma-3-270m-it
1.3 gemma 3n
gemma 3n专门为移动设备开发的超轻量级LLM模型。
https://ai.google.dev/gemma/docs/gemma-3n
https://deepmind.google/models/gemma/gemma-3n/
2 环境安装
2.1 运行环境
mlx-lm是apple平台的大模型推理框架,对mac m1系列处理器支持较好,具有出色的性能。
在python环境就绪后,进一步安装mlx-lm。 mac mlx大模型框架的安装和使用参考
https://blog.csdn.net/liliang199/article/details/149440212
这是使用conda方式安装,指令如下
conda create -n gemma python=3.12
conda activate gemma
pip install mlx-lm
2.2 模型下载
"mlx-community/gemma-3-270m-it-bf16"是gemma3-270m针对mlx平台的量化版本,适合运行在mac m1等计算敏感设备,实测整体运行内存消耗小于1G。
https://huggingface.co/mlx-community/gemma-3-270m-it-bf16
模型本地化下载代码如下,由于hf访问受限,所以需要额外配置hf的国内镜像。
import os
# 配置hf镜像
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"
from mlx_lm import load, generate
# 下载模型到本地
model, tokenizer = load("mlx-community/gemma-3-270m-it-bf16")
3 模型测试
以下是gemma的验证测试代码,包含模型加载、文本生成等整个流程,
import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/gemma-3-270m-it-bf16")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
程序输出
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 173/173 [00:00<00:00, 983kB/s]
model.safetensors.index.json: 17.3kB [00:00, 28.6MB/s] | 0.00/33.4M [00:00<?, ?B/s]
config.json: 1.50kB [00:00, 6.98MB/s] | 0.00/173 [00:00<?, ?B/s]
special_tokens_map.json: 662B [00:00, 1.86MB/s] | 0.00/872M [00:00<?, ?B/s]
tokenizer_config.json: 1.16MB [00:00, 1.72MB/s]s]
chat_template.jinja: 1.53kB [00:00, 3.60MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33.4M/33.4M [00:07<00:00, 4.63MB/s]
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 872M/872M [02:40<00:00, 5.42MB/s]
Fetching 8 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [02:42<00:00, 20.32s/it]
==========tensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 870M/872M [02:40<00:00, 26.4MB/s]
Hello! How can I help you today?
==========
Prompt: 10 tokens, 10.340 tokens-per-sec
Generation: 11 tokens, 97.320 tokens-per-sec
Peak memory: 0.892 GB
依据输出信息,模型峰值存储消耗小于1G,大部分笔记本类设备满足这个硬件要求。
而且1s生成接近100 tokens,速度非常快,颠覆之前LLM在CPU运行速度慢的认知。
reference
🚀 mlx-community/gemma-3-270m-it-bf16
https://model.aibase.com/zh/models/details/1958740542628302848
mac mlx大模型框架的安装和使用
https://blog.csdn.net/liliang199/article/details/149440212
Gemma 3 27B
https://deepranking.ai/llm-models/gemma-3-27b
gemma-2-270m-it
https://huggingface.co/google/gemma-3-270m-it
gemma-3-270m-it-bf16
https://huggingface.co/mlx-community/gemma-3-270m-it-bf16
gemma-3n