实战项目：设计一个智能温控服务

我将带你从头设计一个完整的Android init服务：thermal_guardian（智能温控守护进程）。这个服务监控设备温度并动态调整性能策略。

1. 服务需求分析

功能要求：

实时监控CPU/GPU/电池温度
根据温度动态调整CPU频率
高温时自动降频，防止过热
低温时恢复性能
提供用户可配置的温度阈值
记录温度日志供分析

安全要求：

最小权限原则
防止服务被滥用
安全的进程间通信

2. 完整的.rc文件设计

rc 复制代码

# ======================================================================
# thermal_guardian.rc - 智能温控守护进程
# 位置：/vendor/etc/init/thermal_guardian.rc
# 设计原则：最小权限 + 错误处理 + 日志记录 + 资源控制 + 易维护
# ======================================================================

# ==================== 服务定义部分 ====================
# 服务名称：thermal_guardian
# 执行路径：/vendor/bin/thermal_guardian
# 设计原则：使用最小必要权限运行
service thermal_guardian /vendor/bin/thermal_guardian
    # 类定义：late_start类，确保基础服务已启动
    class late_start
    
    # 最小权限：使用system用户，非root
    user system
    # 必要的组权限
    group system shell graphics input
    # 补充组：用于访问thermal接口
    supplementary_groups proc_temperature
    
    # Linux能力：仅授予必要的capabilities
    # SYS_NICE: 调整进程优先级
    # DAC_OVERRIDE: 访问thermal设备节点
    # SYS_RESOURCE: 修改系统资源限制
    capabilities SYS_NICE DAC_OVERRIDE SYS_RESOURCE
    
    # SELinux上下文：定义安全标签
    # u:r:thermal_guardian:s0 - 自定义安全上下文
    seclabel u:r:thermal_guardian:s0
    
    # I/O调度优先级：使用best-effort，优先级5（中等）
    # 设计原则：不阻塞关键I/O操作
    ioprio be 5
    
    # OOM调整值：-50（较不容易被杀死，但不是关键）
    # 设计原则：在内存紧张时保留，但不是不可杀死
    oom_score_adjust -50
    
    # 进程优先级：nice值为-5（高于普通进程）
    priority -5
    
    # 重启策略：oneshot + watchdog模式
    # oneshot: 服务退出后不自动重启
    # 错误处理：由watchdog监控重启
    oneshot
    
    # 禁用自动启动：等待trigger条件满足
    disabled
    
    # 创建Unix domain socket用于进程间通信
    # 权限：660（owner和group可读写）
    # 设计原则：最小访问权限
    socket thermal_guardian stream 660 system system
    
    # 创建第二个socket用于调试接口
    socket thermal_guardian_debug stream 600 system system
    
    # 写入pid到指定cpuset：限制CPU核心使用
    # 设计原则：控制资源使用，避免占用所有核心
    writepid /dev/cpuset/system-background/tasks
    
    # 环境变量：配置运行参数
    # 可配置参数从环境变量读取，便于调整
    env THERMAL_LOG_LEVEL 2
    env THERMAL_CONFIG_PATH /vendor/etc/thermal_config.json
    env THERMAL_DATA_PATH /data/vendor/thermal
    
    # 资源限制：限制内存使用
    # 软限制：64MB，硬限制：128MB
    # 设计原则：防止内存泄漏
    rlimit as 67108864 134217728
    rlimit rss 33554432 67108864
    
    # 标准输出/错误重定向到logcat
    # 设计原则：完善的日志记录
    stdout logcat:thermal_guardian
    stderr logcat:thermal_guardian:W

# ==================== Trigger定义部分 ====================
# 设计原则：基于事件驱动，减少不必要的启动

# 1. 系统启动完成且温度传感器就绪后启动
on boot && property:sys.boot_completed=1 \
      && property:vendor.thermal.sensors.ready=1
    # 创建必要的目录结构
    mkdir /data/vendor/thermal 0770 system system
    mkdir /data/vendor/thermal/logs 0770 system system
    mkdir /data/vendor/thermal/config 0755 system system
    
    # 设置正确的文件权限
    chown system system /data/vendor/thermal
    chmod 0770 /data/vendor/thermal
    
    # 初始化温度监控
    write /sys/class/thermal/thermal_zone0/mode enabled
    
    # 设置初始温度阈值
    write /sys/class/thermal/thermal_zone0/trip_point_0_temp 75000
    write /sys/class/thermal/thermal_zone0/trip_point_1_temp 85000
    
    # 启动服务
    start thermal_guardian
    
    # 记录启动时间戳
    write /data/vendor/thermal/last_start_time $(date +%s)
    
    # 设置服务状态属性
    setprop vendor.thermal.guardian.active 1
    setprop vendor.thermal.guardian.pid $(pidof thermal_guardian)

# 2. 充电时调整温控策略
on charger && property:vendor.thermal.guardian.active=1
    # 充电时采用更保守的温控策略
    exec -- /vendor/bin/thermal_guardian --set-mode conservative
    
    # 降低温度阈值
    write /sys/class/thermal/thermal_zone0/trip_point_0_temp 70000
    write /sys/class/thermal/thermal_zone0/trip_point_1_temp 80000
    
    # 记录充电模式
    write /data/vendor/thermal/mode "charging"

# 3. 游戏模式：性能优先
on property:persist.vendor.performance_mode=1 \
      && property:vendor.thermal.guardian.active=1
    # 切换到性能模式
    exec -- /vendor/bin/thermal_guardian --set-mode performance
    
    # 提高温度阈值
    write /sys/class/thermal/thermal_zone0/trip_point_0_temp 80000
    write /sys/class/thermal/thermal_zone0/trip_point_1_temp 90000
    
    # 记录模式切换
    write /data/vendor/thermal/mode "performance"

# 4. 省电模式：温度优先
on property:persist.vendor.power_save=1 \
      && property:vendor.thermal.guardian.active=1
    # 切换到省电模式
    exec -- /vendor/bin/thermal_guardian --set-mode powersave
    
    # 降低温度阈值
    write /sys/class/thermal/thermal_zone0/trip_point_0_temp 65000
    write /sys/class/thermal/thermal_zone0/trip_point_1_temp 75000
    
    # 记录模式切换
    write /data/vendor/thermal/mode "powersave"

# 5. 手动控制：通过属性启动/停止
on property:ctl.start=thermal_guardian
    # 检查服务是否已经在运行
    if [ "$(getprop vendor.thermal.guardian.active)" != "1" ]; then
        start thermal_guardian
        setprop vendor.thermal.guardian.active 1
    fi

on property:ctl.stop=thermal_guardian
    # 发送优雅停止信号
    exec -- /vendor/bin/thermal_guardian --shutdown
    
    # 等待服务停止
    stop thermal_guardian
    
    # 更新状态
    setprop vendor.thermal.guardian.active 0

# 6. 高温紧急处理
on property:vendor.thermal.emergency=1
    # 立即降低CPU频率
    write /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq 1400000
    
    # 停止所有非关键后台服务
    exec -- /vendor/bin/thermal_guardian --emergency
    
    # 记录紧急事件
    write /data/vendor/thermal/emergency_$(date +%Y%m%d_%H%M%S).log "Emergency triggered"
    
    # 通知用户
    setprop sys.thermal.alert "High temperature detected"

# 7. Watchdog：监控服务健康状态
# 每隔30秒检查一次服务状态
on property:vendor.thermal.watchdog=1
    if [ "$(pidof thermal_guardian)" = "" ]; then
        # 服务已停止，重新启动
        start thermal_guardian
        
        # 记录重启事件
        write /data/vendor/thermal/restart_$(date +%s).log "Service restarted by watchdog"
    fi

# 8. 定期触发watchdog检查
# 使用内置的定时器功能
on property:sys.thermal.watchdog.timer=1
    # 触发watchdog检查
    setprop vendor.thermal.watchdog 1
    setprop sys.thermal.watchdog.timer 0

# ==================== 错误处理和恢复 ====================
# 设计原则：完善的错误处理和自动恢复机制

# 服务崩溃后的处理
on restart thermal_guardian
    # 记录崩溃信息
    write /data/vendor/thermal/crash_$(date +%s).log "Service crashed, restarting..."
    
    # 检查崩溃次数，防止无限重启
    if [ -f /data/vendor/thermal/crash_count ]; then
        # 读取崩溃次数
        CRASH_COUNT=$(cat /data/vendor/thermal/crash_count)
        CRASH_COUNT=$((CRASH_COUNT + 1))
    else
        CRASH_COUNT=1
    fi
    
    # 保存崩溃次数
    write /data/vendor/thermal/crash_count $CRASH_COUNT
    
    # 如果崩溃次数过多，停止重启
    if [ $CRASH_COUNT -gt 5 ]; then
        # 停止服务
        stop thermal_guardian
        
        # 设置失败状态
        setprop vendor.thermal.guardian.failed 1
        
        # 记录严重错误
        write /data/vendor/thermal/fatal_error.log "Too many crashes, service disabled"
        
        return
    fi
    
    # 增加重启延迟：每次崩溃后延迟增加
    sleep $CRASH_COUNT
    
    # 重新启动服务
    start thermal_guardian

# 系统低内存时的处理
on property:sys.lowmem=1 && property:vendor.thermal.guardian.active=1
    # 减少内存使用
    exec -- /vendor/bin/thermal_guardian --reduce-memory
    
    # 降低日志级别减少I/O
    setprop vendor.thermal.log_level 1

# 系统恢复正常内存
on property:sys.lowmem=0 && property:vendor.thermal.guardian.active=1
    # 恢复内存使用
    exec -- /vendor/bin/thermal_guardian --normal-memory
    
    # 恢复日志级别
    setprop vendor.thermal.log_level 2

# ==================== 调试和监控接口 ====================
# 设计原则：易于调试和维护

# 调试模式：收集详细信息
on property:persist.vendor.thermal.debug=1
    # 启用详细日志
    setprop vendor.thermal.log_level 3
    
    # 启用性能监控
    exec -- /vendor/bin/thermal_guardian --enable-profiling
    
    # 创建调试数据目录
    mkdir /data/vendor/thermal/debug 0770 system system
    
    # 记录调试开始时间
    write /data/vendor/thermal/debug/start_time $(date)

# 收集诊断信息
on property:ctl.dump=thermal_guardian
    # 创建诊断包
    exec -- /vendor/bin/thermal_guardian --dump-diagnostics
    
    # 收集系统状态
    exec -- sh -c "cat /proc/$(pidof thermal_guardian)/status > /data/vendor/thermal/diagnostics/process_status.log"
    exec -- sh -c "cat /proc/$(pidof thermal_guardian)/smaps > /data/vendor/thermal/diagnostics/memory_usage.log"
    
    # 打包诊断信息
    exec -- tar -czf /data/vendor/thermal/diagnostics_$(date +%Y%m%d_%H%M%S).tar.gz /data/vendor/thermal/diagnostics/

3. 对应的二进制程序框架

虽然.rc文件是重点，但为了完整性，这里给出服务程序的基本框架：

c 复制代码

// thermal_guardian.c - 守护进程实现
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <syslog.h>

// 配置文件路径
#define CONFIG_PATH "/vendor/etc/thermal_config.json"
#define DATA_PATH "/data/vendor/thermal"
#define SOCKET_PATH "/dev/socket/thermal_guardian"

// 全局变量
static int running = 1;
static int log_level = 2;

// 信号处理函数
void signal_handler(int sig) {
    switch(sig) {
        case SIGTERM:
            syslog(LOG_INFO, "Received SIGTERM, shutting down gracefully");
            running = 0;
            break;
        case SIGUSR1:
            syslog(LOG_INFO, "Received SIGUSR1, reloading configuration");
            reload_config();
            break;
        case SIGUSR2:
            syslog(LOG_INFO, "Received SIGUSR2, dumping diagnostics");
            dump_diagnostics();
            break;
    }
}

// 温度监控循环
void monitor_temperature() {
    int thermal_fd;
    char temp_buf[32];
    int temperature;
    
    while(running) {
        // 读取温度
        thermal_fd = open("/sys/class/thermal/thermal_zone0/temp", O_RDONLY);
        if(thermal_fd >= 0) {
            read(thermal_fd, temp_buf, sizeof(temp_buf));
            temperature = atoi(temp_buf) / 1000; // 转换为摄氏度
            close(thermal_fd);
            
            // 根据温度调整策略
            adjust_policy(temperature);
            
            // 记录温度
            if(log_level >= 2) {
                syslog(LOG_INFO, "Current temperature: %d°C", temperature);
            }
        }
        
        // 休眠1秒
        sleep(1);
    }
}

// 调整CPU策略
void adjust_policy(int temperature) {
    char freq_buf[32];
    int fd;
    
    if(temperature > 80) {
        // 高温：降频
        fd = open("/sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq", O_WRONLY);
        if(fd >= 0) {
            write(fd, "1400000", 8);
            close(fd);
            syslog(LOG_WARNING, "High temperature! Limiting CPU frequency");
        }
    } else if(temperature < 60) {
        // 正常温度：恢复
        fd = open("/sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq", O_WRONLY);
        if(fd >= 0) {
            write(fd, "2000000", 8);
            close(fd);
        }
    }
}

// 主函数
int main(int argc, char *argv[]) {
    struct sigaction sa;
    
    // 设置信号处理
    memset(&sa, 0, sizeof(sa));
    sa.sa_handler = signal_handler;
    sigaction(SIGTERM, &sa, NULL);
    sigaction(SIGUSR1, &sa, NULL);
    sigaction(SIGUSR2, &sa, NULL);
    
    // 初始化日志
    openlog("thermal_guardian", LOG_PID | LOG_CONS, LOG_DAEMON);
    syslog(LOG_INFO, "Thermal Guardian Service starting");
    
    // 创建socket
    setup_socket();
    
    // 主循环
    monitor_temperature();
    
    // 清理
    syslog(LOG_INFO, "Thermal Guardian Service stopping");
    closelog();
    
    return 0;
}

4. 编译配置文件

makefile 复制代码

# Android.bp 文件
cc_binary {
    name: "thermal_guardian",
    srcs: ["thermal_guardian.c"],
    vendor: true,
    cflags: [
        "-Wall",
        "-Werror",
        "-DLOG_TAG=\"thermal_guardian\"",
    ],
    shared_libs: [
        "libcutils",
        "liblog",
    ],
    init_rc: ["thermal_guardian.rc"],
}

prebuilt_etc {
    name: "thermal_config.json",
    src: "thermal_config.json",
    sub_dir: "",
    vendor: true,
}

5. 使用和调试方法

5.1 手动控制服务

bash 复制代码

# 启动服务
adb shell start thermal_guardian

# 停止服务
adb shell stop thermal_guardian

# 发送自定义信号
adb shell kill -USR1 $(pidof thermal_guardian)

# 检查服务状态
adb shell getprop | grep thermal

# 查看服务进程
adb shell ps -A | grep thermal_guardian

5.2 调试命令

bash 复制代码

# 启用调试模式
adb shell setprop persist.vendor.thermal.debug 1
adb shell restart thermal_guardian

# 查看实时日志
adb logcat -s thermal_guardian

# 检查温度状态
adb shell "cat /sys/class/thermal/thermal_zone*/temp"

# 查看CPU频率
adb shell "cat /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq"

# 收集诊断信息
adb shell setprop ctl.dump thermal_guardian

# 检查socket连接
adb shell "netstat -anp | grep thermal_guardian"

5.3 监控脚本

bash 复制代码

#!/system/bin/sh
# monitor_thermal.sh - 监控温控服务

LOG_FILE="/data/vendor/thermal/monitor.log"
INTERVAL=5

echo "=== Thermal Guardian Monitor ===" > $LOG_FILE
echo "Start time: $(date)" >> $LOG_FILE

while true; do
    echo "=== $(date) ===" >> $LOG_FILE
    
    # 检查服务状态
    if ps | grep -q thermal_guardian; then
        echo "Service: RUNNING" >> $LOG_FILE
        echo "PID: $(pidof thermal_guardian)" >> $LOG_FILE
    else
        echo "Service: STOPPED" >> $LOG_FILE
    fi
    
    # 检查温度
    TEMP=$(cat /sys/class/thermal/thermal_zone0/temp 2>/dev/null | awk '{print $1/1000}')
    echo "Temperature: ${TEMP}°C" >> $LOG_FILE
    
    # 检查CPU频率
    FREQ=$(cat /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq 2>/dev/null)
    echo "CPU Frequency: ${FREQ} kHz" >> $LOG_FILE
    
    # 检查属性
    echo "Properties:" >> $LOG_FILE
    getprop | grep thermal >> $LOG_FILE
    
    sleep $INTERVAL
done

6. 创新点总结

这个设计的创新之处：

智能温控策略：
- 根据不同场景（充电/游戏/省电）自动调整策略
- 动态温度阈值调整
完善的错误处理：
- 崩溃次数限制，防止无限重启
- 内存不足时的自适应调整
- Watchdog监控机制
丰富的调试接口：
- 多种日志级别
- 诊断信息打包
- Socket调试接口
资源控制：
- 内存使用限制
- CPU亲和性设置
- I/O优先级调整
用户可配置：
- 通过属性控制系统行为
- 运行时配置调整
- 多模式支持

7. 扩展可能性

你可以基于这个框架扩展：

AI预测：基于历史数据预测温度变化
应用感知：根据不同应用调整策略
云同步：上传温度数据进行分析
用户反馈：根据用户习惯优化策略

这个设计体现了所有好的init服务设计原则：

✅ 最小权限：使用system用户，仅授予必要capabilities
✅ 错误处理：崩溃恢复、watchdog监控、优雅降级
✅ 日志记录：多级别日志、日志轮转、诊断信息
✅ 资源可控：内存限制、CPU亲和性、I/O优先级
✅ 易于调试：Socket接口、属性控制、诊断模式

通过这个完整示例，你应该能够：

理解如何设计一个完整的init服务
掌握服务配置的各个方面
学会如何调试和维护服务
知道如何扩展服务功能
应用设计原则到自己的项目中

我们设计一个服务：这是一个简单的系统监控服务，定期收集系统状态（CPU、内存、电池等）并写入日志文件，同时提供一个socket接口供其他进程查询当前状态。

服务名称：sys_monitor

位置：/vendor/bin/sys_monitor（假设是vendor分区，因为MTK平台很多定制在vendor）要求：

最小权限：使用system用户，不需要root。

错误处理：服务崩溃后不重启（oneshot），但通过watchdog机制定期检查，如果发现服务不在则重新启动（通过property触发）。

完善的日志记录：服务自身记录日志到logcat和文件。

资源使用可控：限制内存使用，设置CPU亲和性。

易于调试和维护：提供socket接口查询状态，通过属性控制。

我们将创建两个文件：

sys_monitor.rc

sys_monitor（可执行文件，这里不写具体代码，只说明功能，重点在.rc）

由于我们无法编写真正的C代码，所以这里只给出.rc文件，并假设sys_monitor程序已经编译好，放在/vendor/bin/下。

注意：在MTK平台上，通常将自定义服务放在vendor分区，所以.rc文件放在vendor/etc/init/下。