一:概念
Android WatchDog(看门狗)是Android系统中用于监控系统关键服务的运行状态的机制,其核心目标是检测系统服务是否因死锁、阻塞或异常导致长时间无响应,并在必要时触发系统恢复(如重启)。
二:核心功能
2.1 服务状态监控
定期检查关键系统服务(如ActivityManager、WindowManager等)是否正常响应,防止服务阻塞导致系统卡死。
2.2 超时处理
若某个服务未在规定时间内更新"心跳"(monitor),WatchDog判定为超时,触发后续处理流程。
2.3 日志收集与调试
超时发生时,自动收集系统堆栈信息(包括所有线程的调用栈),帮助定位问题根源。
2.4 系统恢复
在严重超时情况下,可能强制重启系统进程(system_server)或整个设备,避免用户长时间面对无响应界面。
三:实现逻辑
3.1 启动
WatchDog像系统服务一样,是由SystemServer启动的,具体启动逻辑如下
frameworks/base/services/java/com/android/server/SystemServer.java
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
...
//Watchdog对象的创建及线程的启动
t.traceBegin("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
mDumper.addDumpable(watchdog);
t.traceEnd();
...
//Watchdog初始化
t.traceBegin("InitWatchdog");
watchdog.init(mSystemContext, mActivityManagerService);
t.traceEnd();
...
}
3.2 初始化
-
WatchDog运行在"watchdog"线程中
-
"watchdog.monitor"后台线程,负责处理通过Handler发送的任务
-
HandlerChecker一个与ServiceThread的Looper关联的Handler。这个Handler会将消息和任务发送到ServiceThread中执行,用于监控线程的响应性,确保线程正常运行
-
把"monitor thread"、"foreground thread"、"main thread"、"ui thread"、"i/o thread"、"display thread"、"animation thread"、"surface animation thread"等系统关键线程加入监控中
-
初始化Binder线程的监视器
frameworks/base/services/core/java/com/android/server/Watchdog.java
private Watchdog() {
//新建一个名为"watchdog"的线程
mThread = new Thread(this::run, "watchdog");... //启动一个名为"watchdog.monitor"的后台线程,负责处理通过Handler发送的任务 ServiceThread t = new ServiceThread("watchdog.monitor", android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/); t.start(); //创建一个与ServiceThread的Looper关联的Handler。这个Handler会将消息和任务发送到ServiceThread中执行,用于监控线程的响应性,确保线程正常运行 mMonitorChecker = new HandlerChecker(new Handler(t.getLooper()), "monitor thread", mLock); //监控"monitor thread"、"foreground thread"、"main thread"、"ui thread"、"i/o thread"、"display thread"、"animation thread"、"surface animation thread"等系统关键线程 mHandlerCheckers.add(withDefaultTimeout(mMonitorChecker)); ... //初始化Binder线程的监视器 addMonitor(new BinderThreadMonitor()); ... } public void start() { //启动"watchdog"线程 mThread.start(); } public void init(Context context, ActivityManagerService activity) { mActivity = activity; //注册重启广播 context.registerReceiver(new RebootRequestReceiver(), new IntentFilter(Intent.ACTION_REBOOT), android.Manifest.permission.REBOOT, null); ... }
3.3 周期性检测
-
检查周期为15s,最大等待时间为60s
-
用不同的状态表示不同的等待时间:WAITING(等待了15s之内)/WAITED_UNTIL_PRE_WATCHDOG(等待了15-60s)/OVERDUE(等待超过60s)/COMPLETED(已完成)
-
等待时间在15s-30s,会收集堆栈信息但不立即重启;如果超过60s,会收集堆栈信息并立刻重启
-
检测手段:调用被监控线程的monitor函数,根据返回时间来判断超时时间
private void run() { boolean waitedHalf = false; while (true) { List<HandlerChecker> blockedCheckers = Collections.emptyList(); ... boolean doWaitedPreDump = false; //watchdog超时时间(60s) final long watchdogTimeoutMillis = mWatchdogTimeoutMillis; //watchdog检查间隔(15s) final long checkIntervalMillis = watchdogTimeoutMillis / PRE_WATCHDOG_TIMEOUT_RATIO; ... synchronized (mLock) { long sfHangTime; long timeout = checkIntervalMillis; ... for (int i=0; i<mHandlerCheckers.size(); i++) { HandlerCheckerAndTimeout hc = mHandlerCheckers.get(i); //把mMonitorQueue中的monitor加入到mMonitors中,并清空mMonitorQueue hc.checker().scheduleCheckLocked(hc.customTimeoutMillis() .orElse(watchdogTimeoutMillis * Build.HW_TIMEOUT_MULTIPLIER)); } ... long start = SystemClock.uptimeMillis(); while (timeout > 0) { ... try { //等待一个检查周期(15s) mLock.wait(timeout); // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting } catch (InterruptedException e) { Log.wtf(TAG, e); } ... timeout = checkIntervalMillis - (SystemClock.uptimeMillis() - start); } ... if (sfHangTime > TIME_SF_WAIT * 2) { ... } else { //针对每一个HandlerChecker的等待时间,返回不用的状态(WAITING-等待了15s之内/WAITED_UNTIL_PRE_WATCHDOG-等待了15-60s/OVERDUE-等待超过60s/COMPLETED-已完成) final int waitState = evaluateCheckerCompletionLocked(); if (waitState == COMPLETED) {//monitor的thread已按时返回心跳,重置waitedHalf,继续下一次循环 ... waitedHalf = false; continue; } else if (waitState == WAITING) {//monitor的thread没有在15s内返回心跳,继续等待,不做处理,继续下一次循环 continue; } else if (waitState == WAITED_UNTIL_PRE_WATCHDOG) {//monitor的thread没有在15s-60s内返回心跳 if (!waitedHalf) {//monitor的thread没有在15s-30s内返回心跳 Slog.i(TAG, "WAITED_UNTIL_PRE_WATCHDOG"); waitedHalf = true; //获取阻塞的线程 blockedCheckers = getCheckersWithStateLocked(WAITED_UNTIL_PRE_WATCHDOG); subject = describeCheckersLocked(blockedCheckers); pids = new ArrayList<>(mInterestingJavaPids); doWaitedPreDump = true; } else {//monitor的thread没有在30s-60s内返回心跳 continue; } } else {//monitor的thread没有在60s内返回心跳 // something is overdue! blockedCheckers = getCheckersWithStateLocked(OVERDUE); subject = describeCheckersLocked(blockedCheckers); allowRestart = mAllowRestart; pids = new ArrayList<>(mInterestingJavaPids); } } } // END synchronized (mLock) //打印堆栈到日志中 logWatchog(doWaitedPreDump, subject, pids); if (doWaitedPreDump) { //monitor的thread没有在15s-30s内返回心跳,继续下一次循环 continue; } ... if (debuggerWasConnected >= 2) { Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process"); } else if (debuggerWasConnected > 0) { Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process"); } else if (!allowRestart) { Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process"); } else { Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject); //诊断被阻塞的检查器并记录相关信息 WatchdogDiagnostics.diagnoseCheckers(blockedCheckers); Slog.w(TAG, "*** GOODBYE!"); ... exceptionHang.WDTMatterJava(330); if (mSfHang) { ... } else { //把WatchDog进程杀掉(WatchDog自杀) Process.killProcess(Process.myPid()); } //终止当前运行的JVM System.exit(10); } waitedHalf = false; } } public static class HandlerChecker implements Runnable { public void scheduleCheckLocked(long handlerCheckerTimeoutMillis) { mWaitMaxMillis = handlerCheckerTimeoutMillis; if (mCompleted) { //把mMonitorQueue中的monitor加入到mMonitors中,并清空mMonitorQueue mMonitors.addAll(mMonitorQueue); mMonitorQueue.clear(); } ... } public void run() { final int size = mMonitors.size(); for (int i = 0 ; i < size ; i++) { synchronized (mLock) { mCurrentMonitor = mMonitors.get(i); } //检查监控线程状态,如果被监控线程卡住,这里也会卡住 mCurrentMonitor.monitor(); } synchronized (mLock) { mCompleted = true; mCurrentMonitor = null; } } }
四:总结
4.1 如何将特定线程加入WatchDog监控
答:可参考AMS
4.1.1 实现Watchdog.Monitor接口
public class ActivityManagerService extends IActivityManager.Stub
implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback, ActivityManagerGlobalLock {
public void monitor() {
synchronized (this) { }
}
}
2.1.2 初始化时把自身加入到WatchDog的监控线程中
public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {
...
//AMS把自身加入到WatchDog的监控线程中
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
...
}
4.2 优点
4.2.1 避免误杀
区分轻度/严重超时,防止短暂高负载导致误重启。
4.2.2 性能开销
检测间隔(默认15秒)权衡实时性与资源消耗。
4.2.3 死锁检测
通过多服务心跳协同,发现跨服务死锁问题。