现象出现
PVE Web 面板突然打不开,浏览器一直 loading,SSH 登录还算正常,下面的虚拟机也能登录。
第一反应:查 pveproxy 进程。
bash
systemctl status pveproxy
结果没有任何响应,连报错、超时、提示都没有,直接挂起。
这就已经能说明,问题不只是 pveproxy 这个服务挂了,而是 systemd 也被拖住了。
排查进程和日志
当 systemctl 自身卡住时,能用的工具其实不多:
journalctl
: 查 systemd 各服务的日志
dmesg
: 看有没有内核异常、OOM、调用栈
tail -f /var/log/syslog
: 实时看后台有没有报错
ps aux
: 如果还能用,找出挂住的进程
top / htop
: 观察 CPU、IO 是否异常
cat /proc/<pid>/stack
: 如果有特定进程卡住,可以直接看内核栈
排查发现 netstat -anp
用不了,但是netstat -an
还能用
日志:
Oct 17 18:36:43 dhclient[889]: DHCPDISCOVER on vmbr1 to 255.255.255.255 port 67 interval 9
Oct 17 18:36:52 dhclient[889]: DHCPDISCOVER on vmbr1 to 255.255.255.255 port 67 interval 13
Oct 17 18:37:05 dhclient[889]: DHCPDISCOVER on vmbr1 to 255.255.255.255 port 67 interval 14
Oct 17 18:37:19 dhclient[889]: DHCPDISCOVER on vmbr1 to 255.255.255.255 port 67 interval 15
Oct 17 18:37:34 dhclient[889]: DHCPDISCOVER on vmbr1 to 255.255.255.255 port 67 interval 2
Oct 17 18:37:36 dhclient[889]: No DHCPOFFERS received.
Oct 17 18:37:36 dhclient[889]: No working leases in persistent database - sleeping.
Oct 17 18:40:30 NetworkManager[704]: <info> [1760697630.7629] device (wlp91s0): set-hw-addr: set MAC address to A6:BF:29:46:77:35 (scanning)
Oct 17 18:40:30 NetworkManager[704]: <info> [1760697630.8327] device (wlp91s0): supplicant interface state: inactive -> interface_disabled
Oct 17 18:40:30 NetworkManager[704]: <info> [1760697630.8328] device (p2p-dev-wlp91s0): supplicant management interface state: inactive -> interface_disabled
Oct 17 18:40:30 NetworkManager[704]: <info> [1760697630.8425] device (wlp91s0): supplicant interface state: interface_disabled -> inactive
Oct 17 18:40:30 NetworkManager[704]: <info> [1760697630.8426] device (p2p-dev-wlp91s0): supplicant management interface state: interface_disabled -> inactive
Oct 17 18:42:38 kernel: general protection fault, probably for non-canonical address 0xa9f8096a4a3b2b1c: 0000 [#1] PREEMPT SMP NOPTI
Oct 17 18:42:38 kernel: CPU: 2 PID: 4107538 Comm: nft Tainted: P OE 6.8.12-9-pve #1
Oct 17 18:42:38 kernel: Hardware name: Micro Computer (HK) Tech Limited Venus Series/AHWSA, BIOS AHWSA.1.22 03/12/2024
Oct 17 18:42:38 kernel: RIP: 0010:vma_interval_tree_insert+0x37/0xd0
Oct 17 18:42:38 kernel: Code: 4c 8d 5f 40 48 8b 47 08 48 2b 07 48 c1 e8 0c 4d 8d 44 01 ff 48 8b 06 48 89 e5 48 85 c0 74 79 41 ba 01 00 00 00 eb 03 48 89 d0 <4c>>
Oct 17 18:42:38 kernel: RSP: 0018:ffffb4cbcc7fba90 EFLAGS: 00010282
Oct 17 18:42:38 kernel: RAX: a9f8096a4a3b2b04 RBX: ffffb4cbcc7fbad8 RCX: ffff9ec0b69c5d88
Oct 17 18:42:38 kernel: RDX: a9f8096a4a3b2b04 RSI: ffff9ec099ebb5c8 RDI: ffff9ec4198cc3c0
Oct 17 18:42:38 kernel: RBP: ffffb4cbcc7fba90 R08: 0000000000000015 R09: 0000000000000004
Oct 17 18:42:38 kernel: R10: 0000000000000000 R11: ffff9ec4198cc400 R12: ffffb4cbcc7fbca8
Oct 17 18:42:38 kernel: R13: ffff9ec1102a9600 R14: ffff9ec4198ccfc0 R15: 0000000000000000
Oct 17 18:42:38 kernel: FS: 0000000000000000(0000) GS:ffff9ec80f500000(0000) knlGS:0000000000000000
Oct 17 18:42:38 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 17 18:42:38 kernel: CR2: 000079616e5d7200 CR3: 000000049981c000 CR4: 0000000000f52ef0
Oct 17 18:42:38 kernel: PKRU: 55555554
Oct 17 18:42:38 kernel: Call Trace:
Oct 17 18:42:38 kernel: <TASK>
Oct 17 18:42:38 kernel: ? show_regs+0x6d/0x80
Oct 17 18:42:38 kernel: ? die_addr+0x37/0xa0
Oct 17 18:42:38 kernel: ? exc_general_protection+0x1db/0x480
Oct 17 18:42:38 kernel: ? asm_exc_general_protection+0x27/0x30
Oct 17 18:42:38 kernel: ? vma_interval_tree_insert+0x37/0xd0
Oct 17 18:42:38 kernel: vma_complete+0x45/0x2c0
Oct 17 18:42:38 kernel: __split_vma+0x270/0x2e0
显示在管理 VMA(虚拟内存区域)树时崩溃,通常由内核 bug、内核模块(第三方驱动)或内存损坏触发;不是单纯 pveproxy 问题。
并排查不出具体原因,应该是内核挂了。
恢复
尝试使用reboot
重启,结果命令无法执行。
使用内核sysrq
进行强制重启
bash
echo 1 > /proc/sys/kernel/sysrq`
echo b > /proc/sysrq-trigger
注意这类操作属于 "无缓冲强制动作",会跳过 systemd
重启后建议检查文件系统,看上一次启动前的崩溃/重启原因
journalctl -b -1 | less