Android看门狗(WatchDog)

一:概念

Android WatchDog(看门狗)是Android系统中用于监控系统关键服务的运行状态的机制,其核心目标是检测系统服务是否因死锁、阻塞或异常导致长时间无响应,并在必要时触发系统恢复(如重启)。

二:核心功能

2.1 服务状态监控

定期检查关键系统服务(如ActivityManager、WindowManager等)是否正常响应,防止服务阻塞导致系统卡死。

2.2 超时处理

若某个服务未在规定时间内更新"心跳"(monitor),WatchDog判定为超时,触发后续处理流程。

2.3 日志收集与调试

超时发生时,自动收集系统堆栈信息(包括所有线程的调用栈),帮助定位问题根源。

2.4 系统恢复

在严重超时情况下,可能强制重启系统进程(system_server)或整个设备,避免用户长时间面对无响应界面。

三:实现逻辑

3.1 启动

WatchDog像系统服务一样,是由SystemServer启动的,具体启动逻辑如下

复制代码
frameworks/base/services/java/com/android/server/SystemServer.java
    private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
        ...
        //Watchdog对象的创建及线程的启动
        t.traceBegin("StartWatchdog");
        final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();
        mDumper.addDumpable(watchdog);
        t.traceEnd();
        ...
        //Watchdog初始化
        t.traceBegin("InitWatchdog");
        watchdog.init(mSystemContext, mActivityManagerService);
        t.traceEnd();
        ...
    }

3.2 初始化

  • WatchDog运行在"watchdog"线程中

  • "watchdog.monitor"后台线程,负责处理通过Handler发送的任务

  • HandlerChecker一个与ServiceThread的Looper关联的Handler。这个Handler会将消息和任务发送到ServiceThread中执行,用于监控线程的响应性,确保线程正常运行

  • 把"monitor thread"、"foreground thread"、"main thread"、"ui thread"、"i/o thread"、"display thread"、"animation thread"、"surface animation thread"等系统关键线程加入监控中

  • 初始化Binder线程的监视器

    frameworks/base/services/core/java/com/android/server/Watchdog.java
    private Watchdog() {
    //新建一个名为"watchdog"的线程
    mThread = new Thread(this::run, "watchdog");

    复制代码
          ...
          //启动一个名为"watchdog.monitor"的后台线程,负责处理通过Handler发送的任务
          ServiceThread t = new ServiceThread("watchdog.monitor",
                  android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
          t.start();
          //创建一个与ServiceThread的Looper关联的Handler。这个Handler会将消息和任务发送到ServiceThread中执行,用于监控线程的响应性,确保线程正常运行
          mMonitorChecker = new HandlerChecker(new Handler(t.getLooper()), "monitor thread", mLock);
          //监控"monitor thread"、"foreground thread"、"main thread"、"ui thread"、"i/o thread"、"display thread"、"animation thread"、"surface animation thread"等系统关键线程
          mHandlerCheckers.add(withDefaultTimeout(mMonitorChecker));
          ...
          //初始化Binder线程的监视器
          addMonitor(new BinderThreadMonitor());
          ...
      }
    
      public void start() {
          //启动"watchdog"线程
          mThread.start();
      }
    
      public void init(Context context, ActivityManagerService activity) {
          mActivity = activity;
          //注册重启广播
          context.registerReceiver(new RebootRequestReceiver(),
                  new IntentFilter(Intent.ACTION_REBOOT),
                  android.Manifest.permission.REBOOT, null);
          ...
      }

3.3 周期性检测

  • 检查周期为15s,最大等待时间为60s

  • 用不同的状态表示不同的等待时间:WAITING(等待了15s之内)/WAITED_UNTIL_PRE_WATCHDOG(等待了15-60s)/OVERDUE(等待超过60s)/COMPLETED(已完成)

  • 等待时间在15s-30s,会收集堆栈信息但不立即重启;如果超过60s,会收集堆栈信息并立刻重启

  • 检测手段:调用被监控线程的monitor函数,根据返回时间来判断超时时间

    复制代码
      private void run() {
          boolean waitedHalf = false;
    
          while (true) {
              List<HandlerChecker> blockedCheckers = Collections.emptyList();
              ...
              boolean doWaitedPreDump = false;
              //watchdog超时时间(60s)
              final long watchdogTimeoutMillis = mWatchdogTimeoutMillis;
              //watchdog检查间隔(15s)
              final long checkIntervalMillis = watchdogTimeoutMillis / PRE_WATCHDOG_TIMEOUT_RATIO;
              ...
              synchronized (mLock) {
                  long sfHangTime;
                  long timeout = checkIntervalMillis;
                  ...
                  for (int i=0; i<mHandlerCheckers.size(); i++) {
                      HandlerCheckerAndTimeout hc = mHandlerCheckers.get(i);
                      //把mMonitorQueue中的monitor加入到mMonitors中,并清空mMonitorQueue
                      hc.checker().scheduleCheckLocked(hc.customTimeoutMillis()
                              .orElse(watchdogTimeoutMillis * Build.HW_TIMEOUT_MULTIPLIER));
                  }
    
                  ...
                  long start = SystemClock.uptimeMillis();
                  while (timeout > 0) {
                      ...
                      try {
                          //等待一个检查周期(15s)
                          mLock.wait(timeout);
                          // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
                      } catch (InterruptedException e) {
                          Log.wtf(TAG, e);
                      }
                      ...
                      timeout = checkIntervalMillis - (SystemClock.uptimeMillis() - start);
                  }
    
                  ...
    
                  if (sfHangTime > TIME_SF_WAIT * 2) {
                      ...
                  } else {
                      //针对每一个HandlerChecker的等待时间,返回不用的状态(WAITING-等待了15s之内/WAITED_UNTIL_PRE_WATCHDOG-等待了15-60s/OVERDUE-等待超过60s/COMPLETED-已完成)
                      final int waitState = evaluateCheckerCompletionLocked();
                      if (waitState == COMPLETED) {//monitor的thread已按时返回心跳,重置waitedHalf,继续下一次循环
                          ...
                          waitedHalf = false;
                          continue;
                      } else if (waitState == WAITING) {//monitor的thread没有在15s内返回心跳,继续等待,不做处理,继续下一次循环
                          continue;
                      } else if (waitState == WAITED_UNTIL_PRE_WATCHDOG) {//monitor的thread没有在15s-60s内返回心跳
                          if (!waitedHalf) {//monitor的thread没有在15s-30s内返回心跳
                              Slog.i(TAG, "WAITED_UNTIL_PRE_WATCHDOG");
                              waitedHalf = true;
                              //获取阻塞的线程
                              blockedCheckers = getCheckersWithStateLocked(WAITED_UNTIL_PRE_WATCHDOG);
                              subject = describeCheckersLocked(blockedCheckers);
                              pids = new ArrayList<>(mInterestingJavaPids);
                              doWaitedPreDump = true;
                          } else {//monitor的thread没有在30s-60s内返回心跳
                              continue;
                          }
                      } else {//monitor的thread没有在60s内返回心跳
                          // something is overdue!
                          blockedCheckers = getCheckersWithStateLocked(OVERDUE);
                          subject = describeCheckersLocked(blockedCheckers);
                          allowRestart = mAllowRestart;
                          pids = new ArrayList<>(mInterestingJavaPids);
                      }
                  }
              } // END synchronized (mLock)
    
              //打印堆栈到日志中
              logWatchog(doWaitedPreDump, subject, pids);
    
              if (doWaitedPreDump) {
                  //monitor的thread没有在15s-30s内返回心跳,继续下一次循环
                  continue;
              }
    
              ...
              if (debuggerWasConnected >= 2) {
                  Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
              } else if (debuggerWasConnected > 0) {
                  Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
              } else if (!allowRestart) {
                  Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
              } else {
                  Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                  //诊断被阻塞的检查器并记录相关信息
                  WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
                  Slog.w(TAG, "*** GOODBYE!");
                  ...
                  exceptionHang.WDTMatterJava(330);
                  if (mSfHang) {
                      ...
                  } else {
                      //把WatchDog进程杀掉(WatchDog自杀)
                      Process.killProcess(Process.myPid());
                  }
                  //终止当前运行的JVM
                  System.exit(10);
              }
    
              waitedHalf = false;
          }
      }
    
      public static class HandlerChecker implements Runnable {
          public void scheduleCheckLocked(long handlerCheckerTimeoutMillis) {
              mWaitMaxMillis = handlerCheckerTimeoutMillis;
    
              if (mCompleted) {
                  //把mMonitorQueue中的monitor加入到mMonitors中,并清空mMonitorQueue
                  mMonitors.addAll(mMonitorQueue);
                  mMonitorQueue.clear();
              }
    
              ...
          }
    
          public void run() {
              final int size = mMonitors.size();
              for (int i = 0 ; i < size ; i++) {
                  synchronized (mLock) {
                      mCurrentMonitor = mMonitors.get(i);
                  }
                  //检查监控线程状态,如果被监控线程卡住,这里也会卡住
                  mCurrentMonitor.monitor();
              }
    
              synchronized (mLock) {
                  mCompleted = true;
                  mCurrentMonitor = null;
              }
          }
      }

四:总结

4.1 如何将特定线程加入WatchDog监控

答:可参考AMS

4.1.1 实现Watchdog.Monitor接口

public class ActivityManagerService extends IActivityManager.Stub

implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback, ActivityManagerGlobalLock {

public void monitor() {

synchronized (this) { }

}

}

2.1.2 初始化时把自身加入到WatchDog的监控线程中

public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {

...

//AMS把自身加入到WatchDog的监控线程中

Watchdog.getInstance().addMonitor(this);

Watchdog.getInstance().addThread(mHandler);

...

}

4.2 优点

4.2.1 避免误杀

区分轻度/严重超时,防止短暂高负载导致误重启。

4.2.2 性能开销

检测间隔(默认15秒)权衡实时性与资源消耗。

4.2.3 死锁检测

通过多服务心跳协同,发现跨服务死锁问题。

相关推荐
踢球的打工仔4 小时前
PHP面向对象(7)
android·开发语言·php
安卓理事人4 小时前
安卓socket
android
安卓理事人10 小时前
安卓LinkedBlockingQueue消息队列
android
万能的小裴同学12 小时前
Android M3U8视频播放器
android·音视频
q***577412 小时前
MySql的慢查询(慢日志)
android·mysql·adb
JavaNoober12 小时前
Android 前台服务 "Bad Notification" 崩溃机制分析文档
android
城东米粉儿13 小时前
关于ObjectAnimator
android
zhangphil14 小时前
Android渲染线程Render Thread的RenderNode与DisplayList,引用Bitmap及Open GL纹理上传GPU
android
火柴就是我15 小时前
从头写一个自己的app
android·前端·flutter
lichong95116 小时前
XLog debug 开启打印日志,release 关闭打印日志
android·java·前端