Hermes部署踩坑记

把 Hermes Agent 从 Docker 搬到 Mini PC，顺手修了 7 个 bug

家里有台吃灰的小主机，8G 内存 128G NVMe，闲着也是闲着。Hermes Agent（NousResearch 那个开源自进化 Agent）一直跑在我主电脑的 Docker Desktop 里，主要痛点：

主电脑得开着 bot 才在线，Telegram 找不到我
Docker Desktop 光底座 3G 内存
WSL2 网络层跑长连接（Telegram getUpdates）每隔几小时掉一次

干脆搬走。装的是 Ubuntu 26.04，功耗 8W 上下，全年电费三十块。本来以为是个一晚上能搞定的事，结果跨度三天，踩了 7 个坑。

记录一下。

整体架构

scss 复制代码

┌─ Mini PC (Ubuntu 26.04, 192.168.3.200) ──────────────────────┐
│                                                               │
│  Mihomo (Clash 内核)  ── 127.0.0.1:7890 ──┐                  │
│       │                                    │                  │
│       │ 订阅                                │                  │
│       ▼                                    ▼                  │
│   你的机场                          hermes-gateway            │
│                                            │                  │
│                                            ▼                  │
│                                     hermes-dashboard:9119     │
└────────────┬──────────────────────────────┬──────────────────┘
             ▼                              ▼
        Telegram API                   局域网 Web UI

三个 systemd 服务：mihomo、hermes-gateway、hermes-dashboard。Telegram 在国内得走代理，所以 mihomo 必须先起来。

装环境

bash 复制代码

sudo apt update
sudo apt install -y python3 python3-venv python3-pip git curl
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs

Mihomo 从 GitHub releases 拉，国内大概率超时，准备一个 ghproxy 兜底：

bash 复制代码

URL="https://github.com/MetaCubeX/mihomo/releases/latest/download/mihomo-linux-amd64.gz"
curl -fL "$URL" -o /tmp/mihomo.gz || curl -fL "https://ghproxy.net/$URL" -o /tmp/mihomo.gz
gunzip /tmp/mihomo.gz && sudo install /tmp/mihomo /usr/local/bin/

sudo mkdir -p /etc/mihomo
sudo curl -fL -o /etc/mihomo/Country.mmdb \
  https://github.com/MetaCubeX/meta-rules-dat/releases/latest/download/country.mmdb

订阅直接喂给 Mihomo，Clash YAML 兼容：

bash 复制代码

curl -fL "你的订阅URL" -o /etc/mihomo/config.yaml
sudo systemctl enable --now mihomo
curl -x http://127.0.0.1:7890 https://api.telegram.org   # 通了说明 OK

装 Hermes

bash 复制代码

git clone https://github.com/NousResearch/hermes-agent ~/hermes-agent
cd ~/hermes-agent
python3 -m venv venv && source venv/bin/activate

export HTTPS_PROXY=http://127.0.0.1:7890
export HTTP_PROXY=http://127.0.0.1:7890
pip install -e .

cd web && npm install && npm run build

老 Docker 卷里的数据 rsync 过来，主要是 ~/.hermes/ 下的 .env、config.yaml、sessions/、memories/、state.db、SOUL.md。

到这一步 hermes chat 已经能正常聊。下一步把它做成服务、对接 Telegram。

systemd unit（最终版）

/etc/systemd/system/hermes-gateway.service：

ini 复制代码

[Unit]
Description=Hermes Agent Gateway
After=network-online.target mihomo.service
Wants=network-online.target

[Service]
Type=simple
User=cn106
WorkingDirectory=/home/cn106/hermes-agent
EnvironmentFile=/home/cn106/.hermes/.env
Environment=HTTPS_PROXY=http://127.0.0.1:7890
Environment=HTTP_PROXY=http://127.0.0.1:7890
ExecStart=/home/cn106/hermes-agent/venv/bin/hermes gateway run

Restart=always
RestartSec=10
StartLimitIntervalSec=0
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target

最后那四行不是凭空写的，是后面踩坑回来加的。下面就是。

踩坑

1. Telegram 按钮全挂：`Button_data_invalid`

Hermes 触发 clarify 工具时会弹 4 个选项按钮让我选，结果一打日志：

css 复制代码

WARNING [Telegram] send_clarify failed: Button_data_invalid

按钮没出来，bot 在那干等我回复。

翻代码定位到这一行：

python 复制代码

InlineKeyboardButton(c, callback_data=f"cq:{c}:{session_key}")

c 是中文选项原文。Telegram 的 callback_data 上限 64 字节，中文随便几个字就爆了。

改成 index + 进程内缓存：

python 复制代码

# 发送时
self._clarify_state[session_key] = tuple(choices)
buttons = [
    InlineKeyboardButton(c, callback_data=f"cq:{i}:{session_key}")
    for i, c in enumerate(choices)
]

# 回调里反查
parts = data.split(":", 2)
idx = int(parts[1])
cached = self._clarify_state.get(session_key) or []
choice = cached[idx] if 0 <= idx < len(cached) else parts[1]
self._clarify_state.pop(session_key, None)

这种 bug 估计英文用户根本碰不到，几个英文单词怎么也塞不爆 64 字节。中文场景必现。

2. 网关"优雅退出"后再也不回来

sudo systemctl restart hermes-gateway 之后服务显示 active (running)，但 Telegram 永远没响应。

上游 Issue #11258 已经在讨论这事：网关收到 SIGTERM 之后会优雅 drain，全部任务跑完后 exit 0。systemd 一看是 0 退出码，按 Restart=on-failure 默认配置就不重启了。

解法是把策略换掉：

ini 复制代码

Restart=always
StartLimitIntervalSec=0
TimeoutStopSec=30

always 不管退出码都拉起。StartLimitIntervalSec=0 关掉重启速率限制------这条配合下一个坑用。

3. bot 幻觉，自己杀自己

偶尔 bot 会发"正在重启自己......"然后真的去 sudo systemctl restart hermes-gateway。如果上下文里那条幻觉消息还在，重启回来又看见，再杀一次。循环。

上游 #8460 也在跟踪。LLM 没有"我跑在 systemd 下面"这种自我意识，把 systemctl 当普通命令用。

我一开始上了 sudoers 黑名单：

sudoers 复制代码

Cmnd_Alias HERMES_KILL = \
    /usr/bin/systemctl restart hermes-gateway, \
    /usr/bin/systemctl restart hermes-gateway.service, \
    /usr/bin/systemctl stop hermes-gateway, \
    /usr/bin/systemctl stop hermes-gateway.service, \
    /usr/bin/systemctl kill hermes-gateway, \
    /usr/bin/systemctl kill hermes-gateway.service

cn106 ALL=(ALL) NOPASSWD: ALL, !HERMES_KILL

第一次写翻车了------sudoers 不让命令参数里出现通配符，hermes-gateway* 直接报 wildcards are not allowed in command arguments，整个文件作废。必须把 .service 后缀也老实列一遍。

跑了一天发现：黑名单同样会拦住我自己 。改个 .env 想重启都得 su - 输 root 密码。日子过不下去。

最后取消了黑名单，改成更宽松的组合：

Restart=always 兜底任何崩溃，幻觉循环最多消耗点 CPU，不会真的把服务搞死
SOUL.md 加一段 "self-restart prohibition" 软规则提醒它别瞎搞
接受偶发的代价

跑了一周，没再触发。

4 & 5. clarify 顺手做的两个抽象

Hermes 让自己（"工程师人格"）顺手补了两个上游缺失的东西：

BasePlatformAdapter.send_clarify 加了默认实现，发"1. 选项 A" "2. 选项 B"这种编号文本，让没有按钮能力的平台（比如纯 IRC）也能跑 clarify。

gateway/run.py::_clarify_callback 是网关层的工具桥，根据平台是否 override 了 send_clarify 决定走按钮路径还是文本路径，session 状态走新的 tools/clarify_state.py 模块。

这俩不是 bug 是配套------只修按钮没用，整条调用链得 refactor。

6. 静默掉线，不知道

systemd 显示 active，但 Telegram 永远没回复。每次都得 journalctl 才知道它"还活着但又没活着"。

加个启动通知，让它每次 connect() 成功后主动给我发一句"我回来了"：

python 复制代码

_startup_chats = self.config.extra.get("startup_notify", [])
for chat_id in _startup_chats:
    await self._bot.send_message(
        chat_id=int(chat_id),
        text="🟢 Hermes Agent is back online!",
    )

配置写在 ~/.hermes/config.yaml：

yaml 复制代码

telegram:
  reactions: false
  channel_prompts: {}
  extra:
    startup_notify:
      - "8583135718"

重启，Telegram 没动静。看日志：

ini 复制代码

[Telegram] Checking startup_notify config...
[Telegram] startup_notify = []

明明配了，运行时就是空。坑 7。

7. 配置 schema 静默丢字段

去翻 gateway/config.py，发现平台配置有两条加载路径，行为完全不同。

第一条（762-808 行）从顶层 telegram: 读，走 whitelist bridge：

python 复制代码

for plat in Platform:
    platform_cfg = yaml_cfg.get(plat.value)
    bridged = {}
    if "free_response_channels" in platform_cfg: bridged["free_response_channels"] = ...
    if "mention_patterns"        in platform_cfg: bridged["mention_patterns"] = ...
    if "channel_prompts"         in platform_cfg: bridged["channel_prompts"] = ...
    # ...一长串 if，唯独没有 extra
    extra.update(bridged)

telegram.extra 整个不在 whitelist，被静默丢弃。

第二条（712-733 行）从顶层 platforms: 读，extra 完整保留：

python 复制代码

yaml_platforms = yaml_cfg.get("platforms")
if isinstance(yaml_platforms, dict):
    for plat_name, plat_block in yaml_platforms.items():
        merged_extra = {**existing.get("extra", {}), **plat_block.get("extra", {})}
        # extra 在这里是完整透传

把 yaml 改成走第二条路就好了：

yaml 复制代码

# 之前（不工作）
telegram:
  extra:
    startup_notify: ["8583135718"]

# 之后（工作）
platforms:
  telegram:
    extra:
      startup_notify: ["8583135718"]

重启完 Telegram 立刻弹了绿点。

两条路径并存却没文档说哪条支持任意 extra，用户必踩。这个值得给上游提 PR，至少加个注释。

收尾

防自杀最终方案就三件：

层	措施	启用
systemd	`Restart=always` + `StartLimitIntervalSec=0`	✓
sudoers	HERMES_KILL 黑名单	✗（影响运维）
行为层	SOUL.md 加规则段	✓

开机自启：

bash 复制代码

sudo systemctl enable mihomo hermes-gateway hermes-dashboard

断电、kernel panic、强拔电源后约 30 秒整套服务自动恢复。监控暂时就靠 systemd 状态加启动通知，规模上来再接 Prometheus。

数字

Hermes 代码层 bug：7 个
环境/部署痛点：5 个
写过的辅助脚本：8 个
改过的 Hermes 源文件：4 个（telegram.py、base.py、run.py、新增 tools/clarify_state.py）
新增 systemd unit：3 个
网关进程稳态内存：约 165 MB

跑了一周，零次人工干预。

想给上游的 PR

fix(telegram): clarify callback_data length，索引化方案，附 reproducer
feat(telegram): startup_notify config，重启后主动发消息
docs(config): document platforms.<name>.extra as the recommended path，至少加一行注释，省后人

写在最后

折腾 AI Agent 部署，本质上和折腾任何分布式系统没区别------所有问题归到最后都是进程生命周期、配置正确传播、网络可达性这老三样。

但 AI Agent 多了一个新维度：它会自己改配置、自己重启自己、自己幻觉。这对运维思维是个有意思的挑战------既要给它足够权限去自治，又要划个不会自残的笼子。

完整脚本和上面所有 patch ，需要的可以留言。

Hermes部署踩坑记

把 Hermes Agent 从 Docker 搬到 Mini PC，顺手修了 7 个 bug

整体架构

装环境

装 Hermes

systemd unit（最终版）

踩坑

1. Telegram 按钮全挂：Button_data_invalid

2. 网关"优雅退出"后再也不回来

3. bot 幻觉，自己杀自己

4 & 5. clarify 顺手做的两个抽象

6. 静默掉线，不知道

7. 配置 schema 静默丢字段

收尾

数字

想给上游的 PR

写在最后

1. Telegram 按钮全挂：`Button_data_invalid`