Android看门狗(WatchDog)

一:概念

Android WatchDog(看门狗)是Android系统中用于监控系统关键服务的运行状态的机制,其核心目标是检测系统服务是否因死锁、阻塞或异常导致长时间无响应,并在必要时触发系统恢复(如重启)。

二:核心功能

2.1 服务状态监控

定期检查关键系统服务(如ActivityManager、WindowManager等)是否正常响应,防止服务阻塞导致系统卡死。

2.2 超时处理

若某个服务未在规定时间内更新"心跳"(monitor),WatchDog判定为超时,触发后续处理流程。

2.3 日志收集与调试

超时发生时,自动收集系统堆栈信息(包括所有线程的调用栈),帮助定位问题根源。

2.4 系统恢复

在严重超时情况下,可能强制重启系统进程(system_server)或整个设备,避免用户长时间面对无响应界面。

三:实现逻辑

3.1 启动

WatchDog像系统服务一样,是由SystemServer启动的,具体启动逻辑如下

frameworks/base/services/java/com/android/server/SystemServer.java
    private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
        ...
        //Watchdog对象的创建及线程的启动
        t.traceBegin("StartWatchdog");
        final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();
        mDumper.addDumpable(watchdog);
        t.traceEnd();
        ...
        //Watchdog初始化
        t.traceBegin("InitWatchdog");
        watchdog.init(mSystemContext, mActivityManagerService);
        t.traceEnd();
        ...
    }

3.2 初始化

  • WatchDog运行在"watchdog"线程中

  • "watchdog.monitor"后台线程,负责处理通过Handler发送的任务

  • HandlerChecker一个与ServiceThread的Looper关联的Handler。这个Handler会将消息和任务发送到ServiceThread中执行,用于监控线程的响应性,确保线程正常运行

  • 把"monitor thread"、"foreground thread"、"main thread"、"ui thread"、"i/o thread"、"display thread"、"animation thread"、"surface animation thread"等系统关键线程加入监控中

  • 初始化Binder线程的监视器

    frameworks/base/services/core/java/com/android/server/Watchdog.java
    private Watchdog() {
    //新建一个名为"watchdog"的线程
    mThread = new Thread(this::run, "watchdog");

          ...
          //启动一个名为"watchdog.monitor"的后台线程,负责处理通过Handler发送的任务
          ServiceThread t = new ServiceThread("watchdog.monitor",
                  android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
          t.start();
          //创建一个与ServiceThread的Looper关联的Handler。这个Handler会将消息和任务发送到ServiceThread中执行,用于监控线程的响应性,确保线程正常运行
          mMonitorChecker = new HandlerChecker(new Handler(t.getLooper()), "monitor thread", mLock);
          //监控"monitor thread"、"foreground thread"、"main thread"、"ui thread"、"i/o thread"、"display thread"、"animation thread"、"surface animation thread"等系统关键线程
          mHandlerCheckers.add(withDefaultTimeout(mMonitorChecker));
          ...
          //初始化Binder线程的监视器
          addMonitor(new BinderThreadMonitor());
          ...
      }
    
      public void start() {
          //启动"watchdog"线程
          mThread.start();
      }
    
      public void init(Context context, ActivityManagerService activity) {
          mActivity = activity;
          //注册重启广播
          context.registerReceiver(new RebootRequestReceiver(),
                  new IntentFilter(Intent.ACTION_REBOOT),
                  android.Manifest.permission.REBOOT, null);
          ...
      }
    

3.3 周期性检测

  • 检查周期为15s,最大等待时间为60s

  • 用不同的状态表示不同的等待时间:WAITING(等待了15s之内)/WAITED_UNTIL_PRE_WATCHDOG(等待了15-60s)/OVERDUE(等待超过60s)/COMPLETED(已完成)

  • 等待时间在15s-30s,会收集堆栈信息但不立即重启;如果超过60s,会收集堆栈信息并立刻重启

  • 检测手段:调用被监控线程的monitor函数,根据返回时间来判断超时时间

      private void run() {
          boolean waitedHalf = false;
    
          while (true) {
              List<HandlerChecker> blockedCheckers = Collections.emptyList();
              ...
              boolean doWaitedPreDump = false;
              //watchdog超时时间(60s)
              final long watchdogTimeoutMillis = mWatchdogTimeoutMillis;
              //watchdog检查间隔(15s)
              final long checkIntervalMillis = watchdogTimeoutMillis / PRE_WATCHDOG_TIMEOUT_RATIO;
              ...
              synchronized (mLock) {
                  long sfHangTime;
                  long timeout = checkIntervalMillis;
                  ...
                  for (int i=0; i<mHandlerCheckers.size(); i++) {
                      HandlerCheckerAndTimeout hc = mHandlerCheckers.get(i);
                      //把mMonitorQueue中的monitor加入到mMonitors中,并清空mMonitorQueue
                      hc.checker().scheduleCheckLocked(hc.customTimeoutMillis()
                              .orElse(watchdogTimeoutMillis * Build.HW_TIMEOUT_MULTIPLIER));
                  }
    
                  ...
                  long start = SystemClock.uptimeMillis();
                  while (timeout > 0) {
                      ...
                      try {
                          //等待一个检查周期(15s)
                          mLock.wait(timeout);
                          // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
                      } catch (InterruptedException e) {
                          Log.wtf(TAG, e);
                      }
                      ...
                      timeout = checkIntervalMillis - (SystemClock.uptimeMillis() - start);
                  }
    
                  ...
    
                  if (sfHangTime > TIME_SF_WAIT * 2) {
                      ...
                  } else {
                      //针对每一个HandlerChecker的等待时间,返回不用的状态(WAITING-等待了15s之内/WAITED_UNTIL_PRE_WATCHDOG-等待了15-60s/OVERDUE-等待超过60s/COMPLETED-已完成)
                      final int waitState = evaluateCheckerCompletionLocked();
                      if (waitState == COMPLETED) {//monitor的thread已按时返回心跳,重置waitedHalf,继续下一次循环
                          ...
                          waitedHalf = false;
                          continue;
                      } else if (waitState == WAITING) {//monitor的thread没有在15s内返回心跳,继续等待,不做处理,继续下一次循环
                          continue;
                      } else if (waitState == WAITED_UNTIL_PRE_WATCHDOG) {//monitor的thread没有在15s-60s内返回心跳
                          if (!waitedHalf) {//monitor的thread没有在15s-30s内返回心跳
                              Slog.i(TAG, "WAITED_UNTIL_PRE_WATCHDOG");
                              waitedHalf = true;
                              //获取阻塞的线程
                              blockedCheckers = getCheckersWithStateLocked(WAITED_UNTIL_PRE_WATCHDOG);
                              subject = describeCheckersLocked(blockedCheckers);
                              pids = new ArrayList<>(mInterestingJavaPids);
                              doWaitedPreDump = true;
                          } else {//monitor的thread没有在30s-60s内返回心跳
                              continue;
                          }
                      } else {//monitor的thread没有在60s内返回心跳
                          // something is overdue!
                          blockedCheckers = getCheckersWithStateLocked(OVERDUE);
                          subject = describeCheckersLocked(blockedCheckers);
                          allowRestart = mAllowRestart;
                          pids = new ArrayList<>(mInterestingJavaPids);
                      }
                  }
              } // END synchronized (mLock)
    
              //打印堆栈到日志中
              logWatchog(doWaitedPreDump, subject, pids);
    
              if (doWaitedPreDump) {
                  //monitor的thread没有在15s-30s内返回心跳,继续下一次循环
                  continue;
              }
    
              ...
              if (debuggerWasConnected >= 2) {
                  Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
              } else if (debuggerWasConnected > 0) {
                  Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
              } else if (!allowRestart) {
                  Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
              } else {
                  Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                  //诊断被阻塞的检查器并记录相关信息
                  WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
                  Slog.w(TAG, "*** GOODBYE!");
                  ...
                  exceptionHang.WDTMatterJava(330);
                  if (mSfHang) {
                      ...
                  } else {
                      //把WatchDog进程杀掉(WatchDog自杀)
                      Process.killProcess(Process.myPid());
                  }
                  //终止当前运行的JVM
                  System.exit(10);
              }
    
              waitedHalf = false;
          }
      }
    
      public static class HandlerChecker implements Runnable {
          public void scheduleCheckLocked(long handlerCheckerTimeoutMillis) {
              mWaitMaxMillis = handlerCheckerTimeoutMillis;
    
              if (mCompleted) {
                  //把mMonitorQueue中的monitor加入到mMonitors中,并清空mMonitorQueue
                  mMonitors.addAll(mMonitorQueue);
                  mMonitorQueue.clear();
              }
    
              ...
          }
    
          public void run() {
              final int size = mMonitors.size();
              for (int i = 0 ; i < size ; i++) {
                  synchronized (mLock) {
                      mCurrentMonitor = mMonitors.get(i);
                  }
                  //检查监控线程状态,如果被监控线程卡住,这里也会卡住
                  mCurrentMonitor.monitor();
              }
    
              synchronized (mLock) {
                  mCompleted = true;
                  mCurrentMonitor = null;
              }
          }
      }
    

四:总结

4.1 如何将特定线程加入WatchDog监控

答:可参考AMS

4.1.1 实现Watchdog.Monitor接口

public class ActivityManagerService extends IActivityManager.Stub

implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback, ActivityManagerGlobalLock {

public void monitor() {

synchronized (this) { }

}

}

2.1.2 初始化时把自身加入到WatchDog的监控线程中

public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {

...

//AMS把自身加入到WatchDog的监控线程中

Watchdog.getInstance().addMonitor(this);

Watchdog.getInstance().addThread(mHandler);

...

}

4.2 优点

4.2.1 避免误杀

区分轻度/严重超时,防止短暂高负载导致误重启。

4.2.2 性能开销

检测间隔(默认15秒)权衡实时性与资源消耗。

4.2.3 死锁检测

通过多服务心跳协同,发现跨服务死锁问题。

相关推荐
双鱼大猫1 小时前
一句话说透Android里面的Window的内部机制
android
双鱼大猫1 小时前
一句话说透Android里面的为什么要设计Window?
android
双鱼大猫1 小时前
一句话说透Android里面的主线程创建时机,frameworks层面分析
android
苏金标2 小时前
android 快速定位当前页面
android
雾里看山5 小时前
【MySQL】内置函数
android·数据库·mysql
风浅月明5 小时前
[Android]页面间传递model列表
android
法迪5 小时前
Android自带的省电模式主要做什么呢?
android·功耗
风浅月明5 小时前
[Android]AppCompatEditText限制最多只能输入两位小数
android
没有晚不了安6 小时前
1.11作业
android
zhangphil6 小时前
Android Coil3缩略图、默认占位图placeholder、error加载错误显示,Kotlin(1)
android·kotlin