HermesAgent 终端工具 Windows 兼容性修复实战：两个 Bug 的排查与解决

大家好，我是张大鹏，10 年全栈开发经验，目前在做 AI 在线教育培训。最近在基于 HermesAgent 做二次开发，在 Windows 原生环境下遇到了一个很诡异的问题------Agent 的终端命令全部返回空输出。本文记录完整的排查过程和修复方案，涉及 Python select.select() 的 Windows 兼容性和 Git Bash 路径格式转换两个坑。

问题现象

在上一篇文章中，我已经成功在 Windows 11 原生环境下安装并运行了 HermesAgent。一切看起来都正常------API 调用没问题，模型回复也正常。

但当我让 Agent 执行终端命令时，问题出现了：

复制代码

● 你知道当前的工作空间的绝对路径吗

┊ 💻 preparing terminal...
┊ 💻 $         pwd  0.9s
┊ 💻 preparing terminal...
┊ 💻 $         echo $PWD  0.3s
┊ 💻 preparing terminal...
┊ 💻 ls -la  (54.6s)

Agent 跑了一圈，最后告诉我：

抱歉，当前环境遇到了一些异常------终端命令全部返回空输出，文件读取也无法正常工作。

命令确实执行了（有耗时），但输出全部为空。Agent 自己也拿不到结果，只能跟我说"你自己在终端里跑一下吧"。

这不行啊。HermesAgent 的核心能力之一就是终端工具------它要靠执行 shell 命令来读文件、装依赖、跑测试。终端工具废了，Agent 就是个只会聊天的废柴。

排查过程

第一步：确认 Git Bash 本身没问题

HermesAgent 在 Windows 上用 Git Bash 执行命令（不是 PowerShell）。先确认 Git Bash 本身能不能正常工作：

bash 复制代码

$ "C:\Program Files\Git\usr\bin\bash.EXE" -c "pwd"
/d/code/HermesAgent

没问题。Git Bash 本身工作正常。

第二步：用 Python 直接调用 subprocess

python 复制代码

import subprocess

bash_path = r'C:\Program Files\Git\usr\bin\bash.EXE'
proc = subprocess.Popen(
    [bash_path, '-c', 'echo hello'],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True,
)
output = proc.stdout.read()
print(repr(output))  # 'hello\n'

也没问题。Python 的 subprocess 能正常捕获 Git Bash 的输出。

第三步：用 HermesAgent 的终端工具调用

python 复制代码

from tools.environments.local import LocalEnvironment

env = LocalEnvironment(cwd=r'D:\code\HermesAgent', timeout=10)
result = env.execute('echo hello')
print(repr(result.get('output', '')))  # ''
print(repr(result.get('returncode')))  # 0

输出为空，但返回码是 0！ 命令执行成功了，就是拿不到输出。

这就奇怪了------同样的 bash，同样的命令，直接调 subprocess 没问题，但通过 HermesAgent 的终端工具就丢输出。

第四步：看终端工具的执行流程

HermesAgent 的终端工具执行命令分三步：

init_session：启动一个 bash 登录 shell，把环境变量快照到临时文件
execute：每次执行命令时，先 source 快照文件，再 cd 到工作目录，再执行命令
_wait_for_process ：用 select.select() 轮询管道，收集输出

关键在第三步。_wait_for_process 里有一个 _drain() 函数负责从管道读取输出：

python 复制代码

def _drain():
    fd = proc.stdout.fileno()
    idle_after_exit = 0
    try:
        while True:
            try:
                ready, _, _ = select.select([fd], [], [], 0.1)
            except (ValueError, OSError):
                break  # fd already closed  ← 问题在这里！
            if ready:
                chunk = os.read(fd, 4096)
                if not chunk:
                    break
                output_chunks.append(decoder.decode(chunk))
                idle_after_exit = 0
            elif proc.poll() is not None:
                idle_after_exit += 1
                if idle_after_exit >= 3:
                    break
    finally:
        # Flush buffered bytes
        ...

看到那个 except (ValueError, OSError): break 了吗？

第五步：验证 select.select() 在 Windows 上的行为

python 复制代码

import subprocess, select

proc = subprocess.Popen(
    [bash_path, '-c', 'echo hello'],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True,
)
fd = proc.stdout.fileno()
try:
    ready, _, _ = select.select([fd], [], [], 1.0)
    print('select works! ready:', ready)
except Exception as e:
    print('select FAILED:', type(e).__name__, e)

输出：

复制代码

select FAILED: OSError [WinError 10093] 应用程序没有调用 WSAStartup，或者 WSAStartup 失败。

根因找到了！

Python 的 select.select() 在 Windows 上只支持 socket ，不支持管道（pipe）文件描述符。这是 Windows 的限制------Windows 的 select() 函数来自 WinSock 库，只能用于网络 socket，不能用于文件描述符。

所以 _drain() 函数在 Windows 上：

调用 select.select() → 抛出 OSError
捕获异常 → break 退出循环
输出永远不会被读取

命令确实执行了，输出也在管道里，但没人去读它。

Bug 1 修复：Windows 上用阻塞读取替代 select

修复思路很简单：在 Windows 上，select.select() 不能用，那就直接用阻塞的 os.read()。因为父线程有超时机制，超时后会 kill 进程，所以阻塞读取不会永远卡住。

python 复制代码

def _drain():
    fd = proc.stdout.fileno()
    idle_after_exit = 0
    try:
        while True:
            try:
                ready, _, _ = select.select([fd], [], [], 0.1)
            except (ValueError, OSError):
                # 在 Windows 上，select() 不支持管道 fd
                # 改用阻塞读取（父线程超时会 kill 进程，不会永远卡住）
                if platform.system() == "Windows":
                    try:
                        chunk = os.read(fd, 4096)
                    except (ValueError, OSError):
                        break
                    if not chunk:
                        break
                    output_chunks.append(decoder.decode(chunk))
                    continue
                break  # 非 Windows 环境，fd 已关闭
            if ready:
                # ... 正常读取逻辑不变

核心改动就一处：当 select.select() 在 Windows 上抛出 OSError 时，不直接 break，而是改用阻塞的 os.read() 继续读取输出。

以为修好了，结果又炸了

修完 select.select() 的问题后，我满怀信心地测试：

python 复制代码

env = LocalEnvironment(cwd=r'D:\code\HermesAgent', timeout=10)
result = env.execute('echo hello')

结果：

复制代码

NotADirectoryError: [WinError 267] 目录名称无效。

等等，刚才还好好的，怎么现在连命令都执行不了了？

新问题：CWD 路径格式错误

加了调试信息后发现：

python 复制代码

env = LocalEnvironment(cwd=r'D:\code\HermesAgent', timeout=10)
print('cwd:', env.cwd)  # /d/code/HermesAgent

self.cwd 从 D:\code\HermesAgent 变成了 /d/code/HermesAgent！

这是 init_session 的"副作用"。init_session 会执行 pwd -P 并把结果写入临时文件，然后 _update_cwd 读取这个文件更新 self.cwd。

在 Git Bash 中，pwd -P 返回的是 Unix 风格路径：

bash 复制代码

$ pwd -P
/d/code/HermesAgent

但 Windows 的 subprocess.Popen(cwd=...) 不认识 /d/code/HermesAgent，它需要 D:\code\HermesAgent。

所以执行流程是：

__init__ 设置 self.cwd = 'D:\code\HermesAgent'（正确）
init_session 调用 bash，bash 执行 pwd -P 写入 /d/code/HermesAgent
_update_cwd 读取临时文件，把 self.cwd 更新为 /d/code/HermesAgent（错误！）
下次 execute 时，subprocess.Popen(cwd='/d/code/HermesAgent') → 报错

Bug 2 修复：Git Bash 路径自动转 Windows 路径

需要在 _update_cwd 中把 Git Bash 的 /d/code/... 格式转换为 Windows 的 D:\code\... 格式：

python 复制代码

@staticmethod
def _git_bash_to_win_path(path: str) -> str:
    """把 Git Bash 路径 (/d/code/...) 转为 Windows 路径 (D:\\code\\...)."""
    import re
    m = re.match(r"^/([a-zA-Z])(/.*)?$", path)
    if m:
        drive = m.group(1).upper()
        rest = (m.group(2) or "").replace("/", "\\")
        return f"{drive}:{rest}"
    return path

然后在 _update_cwd 中调用：

python 复制代码

def _update_cwd(self, result: dict):
    try:
        cwd_path = open(self._cwd_file).read().strip()
        if cwd_path:
            if _IS_WINDOWS:
                cwd_path = self._git_bash_to_win_path(cwd_path)
            self.cwd = cwd_path
    except (OSError, FileNotFoundError):
        pass

    self._extract_cwd_from_output(result)

    # _extract_cwd_from_output 也可能从 marker 中读到 Git Bash 路径
    if _IS_WINDOWS:
        self.cwd = self._git_bash_to_win_path(self.cwd)

注意最后三行------_extract_cwd_from_output 也会从命令输出的 CWD marker 中解析路径并设置 self.cwd，所以需要在它之后再做一次转换。

修复验证

两个 Bug 都修完后，终端工具终于正常了：

python 复制代码

env = LocalEnvironment(cwd=r'D:\code\HermesAgent', timeout=10)

r1 = env.execute('pwd')
print('pwd:', r1['output'].strip())
# pwd: /d/code/HermesAgent

r2 = env.execute('echo hello world')
print('echo:', r2['output'].strip())
# echo: hello world

r3 = env.execute('ls pyproject.toml')
print('ls:', r3['output'].strip())
# ls: pyproject.toml

r4 = env.execute('python --version')
print('python:', r4['output'].strip())
# python: Python 3.13.13

r5 = env.execute('cd hermes_workspace && pwd')
print('cd:', env.cwd)
# cd: D:\code\HermesAgent\hermes_workspace

命令输出正常，CWD 跟踪也正确（自动转换为 Windows 路径）。

用 hermes 实际测试：

bash 复制代码

$ python -m hermes_cli.main -z "当前工作空间的绝对路径是什么？用pwd命令确认"
当前工作空间的绝对路径是: /d/code/HermesAgent

Agent 终于能正常执行终端命令并拿到输出了。

技术总结

Bug 1：select.select() 的 Windows 限制

项目	说明
根因	Python 的 `select.select()` 在 Windows 上只支持 socket，不支持管道 fd
表现	`_drain()` 函数捕获 `OSError` 后立即退出，输出丢失
影响	所有终端命令返回空输出，Agent 无法执行任何 shell 操作
修复	Windows 上改用阻塞 `os.read()`，依赖父线程超时机制保证不死锁

Bug 2：Git Bash 路径格式不兼容

项目	说明
根因	Git Bash 的 `pwd -P` 返回 `/d/code/...` 格式，Windows 的 `subprocess.Popen` 不认识
表现	`init_session` 后 `self.cwd` 变为 Git Bash 路径，后续命令报 `NotADirectoryError`
影响	除第一条命令外，所有后续命令都会失败
修复	在 `_update_cwd` 中用正则把 `/d/...` 转换为 `D:\...`

修改的文件

tools/environments/base.py：添加 import platform，修改 _drain() 函数
tools/environments/local.py：添加 _git_bash_to_win_path() 方法，修改 _update_cwd() 方法

经验教训

select.select() 在 Windows 上是半残的 。它只支持 socket，不支持文件描述符。如果你的 Python 代码需要用 select 来轮询管道输出，一定要在 Windows 上做特殊处理。
Git Bash 的路径格式和 Windows 不兼容 。Git Bash 用 /d/code/... 表示 D:\code\...，在跨环境调用时一定要做路径转换。
"能跑"不等于"能用" 。HermesAgent 在 Windows 上启动没问题，API 调用也没问题，但终端工具这个核心功能是坏的。如果只是跑个 hermes -z "1+1=?" 验证一下，根本发现不了这个问题。只有真正让 Agent 去执行任务，才会暴露出来。
源码级调试的重要性 。如果我只是用 hermes.exe，根本不知道内部发生了什么。正是因为用源码方式启动（python -m hermes_cli.main），才能直接打断点、加日志、定位问题。

参考

我是张大鹏，10 年全栈开发，现在专注于 AI 在线教育培训。大鹏 AI 教育团队持续分享 AI Agent 开发实战经验，欢迎关注。