mac基于mlx运行轻量级模型gemma-3-270m

参考网络资料，基于mlx-lm(0.28.1)，尝试运行近期热门的轻量级模型gemma-3-270m-it-bf16，其原始模型为Gemma 3 270M拥有2.7亿个参数，只需极少额外训练即可定制，专为特定任务微调而构建。

1 gemma-3

1.1 gemma-3

gemma-3是google开发的轻量级LLM多模态模型，提供1B、4B、12B、27B等多个版本，分别基于2T、4T、12T、24T token语料训练。gemma-3可理解140多种语言，支持输入视频输入和输出文本，支持结构化输出和函数调用，其27B的模型能力甚至能匹敌Deepseek V3，o1-prereview，o3-min-high。

Gemma 3 marks a decisive leap forward in open vision-language models. The 27B instruction-tuned variant (Gemma-3-27B-IT) delivers a 1338 Elo score on LMSys Chatbot Arena, ranking in the top 10 across all models and outperforming most open-weight peers---including DeepSeek-V3, Qwen2.5-72B, and even Meta's 405B LLaMA 3.1. On standard benchmarks, it posts strong results: 67.5 on MMLU-Pro, 29.7 on LiveCodeBench, and 54.4 on Bird-SQL, showing robust reasoning and coding ability. Its 89.0 on MATH and 74.9 on FACTS Grounding reflect precision in symbolic tasks and factual alignment. This is enabled by a novel post-training pipeline blending distillation, RLHF (with reward models like BOND and WARP), and extensive multilingual tuning across 140+ languages using a 262K-token SentencePiece tokenizer. Architecturally, Gemma 3 introduces efficient long-context handling (up to 128K tokens) through RoPE scaling and 5:1 local-to-global attention layering---cutting KV cache memory by up to 85% vs. global-only designs without hurting perplexity. Multimodal inputs are powered by a frozen 400M SigLIP vision encoder and enhanced at inference with Pan & Scan, helping Gemma 3 excel on real-world image tasks (e.g., +17 points on InfoVQA with P&S). Its release spans 1B to 27B dense models with instruction-tuned and pre-trained variants---all deployable via Hugging Face, MLX, or llama.cpp. With day-zero support across tooling, near-SOTA performance, and strong safety benchmarks, Gemma 3 is a high-performing, accessible alternative to Gemini 1.5-Pro and a defining model in the open frontier.

https://deepranking.ai/llm-models/gemma-3-27b

1.2 Gemma 3 270M

Gemma 3 270M是一个精简的基础模型，拥有2.7亿个参数，专为特定任务的高效微调而构建。拥有强大指令跟踪能力和文本结构化能力，只需极少额外训练即可部署和定制。

Gemma 3 270M专为适配定制用例设计，认为效率比模型能力更重要。这种特性对设备端AI、隐私敏感、合规要求高等效率偏好行、要求定制和本地部署的任务至关重要。

https://huggingface.co/google/gemma-3-270m-it

1.3 gemma 3n

gemma 3n专门为移动设备开发的超轻量级LLM模型。

https://ai.google.dev/gemma/docs/gemma-3n

https://deepmind.google/models/gemma/gemma-3n/

2 环境安装

2.1 运行环境

mlx-lm是apple平台的大模型推理框架，对mac m1系列处理器支持较好，具有出色的性能。

在python环境就绪后，进一步安装mlx-lm。 mac mlx大模型框架的安装和使用参考

https://blog.csdn.net/liliang199/article/details/149440212

这是使用conda方式安装，指令如下

conda create -n gemma python=3.12

conda activate gemma

pip install mlx-lm

2.2 模型下载

"mlx-community/gemma-3-270m-it-bf16"是gemma3-270m针对mlx平台的量化版本，适合运行在mac m1等计算敏感设备，实测整体运行内存消耗小于1G。

https://huggingface.co/mlx-community/gemma-3-270m-it-bf16

模型本地化下载代码如下，由于hf访问受限，所以需要额外配置hf的国内镜像。

复制代码

import os
# 配置hf镜像
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"

from mlx_lm import load, generate
# 下载模型到本地
model, tokenizer = load("mlx-community/gemma-3-270m-it-bf16")

3 模型测试

以下是gemma的验证测试代码，包含模型加载、文本生成等整个流程，

复制代码

import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-3-270m-it-bf16")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

程序输出

generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 173/173 [00:00<00:00, 983kB/s]

model.safetensors.index.json: 17.3kB [00:00, 28.6MB/s] | 0.00/33.4M [00:00<?, ?B/s]

config.json: 1.50kB [00:00, 6.98MB/s] | 0.00/173 [00:00<?, ?B/s]

special_tokens_map.json: 662B [00:00, 1.86MB/s] | 0.00/872M [00:00<?, ?B/s]

tokenizer_config.json: 1.16MB [00:00, 1.72MB/s]s]

chat_template.jinja: 1.53kB [00:00, 3.60MB/s]

tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33.4M/33.4M [00:07<00:00, 4.63MB/s]

model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 872M/872M [02:40<00:00, 5.42MB/s]

Fetching 8 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [02:42<00:00, 20.32s/it]

==========tensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 870M/872M [02:40<00:00, 26.4MB/s]

Hello! How can I help you today?

==========

Prompt: 10 tokens, 10.340 tokens-per-sec

Generation: 11 tokens, 97.320 tokens-per-sec

Peak memory: 0.892 GB

依据输出信息，模型峰值存储消耗小于1G，大部分笔记本类设备满足这个硬件要求。

而且1s生成接近100 tokens，速度非常快，颠覆之前LLM在CPU运行速度慢的认知。

reference

🚀 mlx-community/gemma-3-270m-it-bf16

https://model.aibase.com/zh/models/details/1958740542628302848

mac mlx大模型框架的安装和使用

https://blog.csdn.net/liliang199/article/details/149440212

Gemma 3 27B

https://deepranking.ai/llm-models/gemma-3-27b

gemma-2-270m-it

https://huggingface.co/google/gemma-3-270m-it

gemma-3-270m-it-bf16

https://huggingface.co/mlx-community/gemma-3-270m-it-bf16

gemma-3n

https://deepmind.google/models/gemma/gemma-3n/