导读
本章讲解Skill系统与AgentScope Runtime Sandbox的集成 ,通过真实的AIOps场景展示如何构建执行脚本、访问文件系统、调用系统工具的Skill系统。
学习目标:
- 理解Runtime Sandbox的架构设计
- 掌握Skill与文件系统的关联机制
- 学会实现脚本执行类Skill
- 掌握系统工具(grep、cat等)的集成
- 了解自定义工具的扩展机制
- 实现AIOps场景:故障诊断 + 报告生成
sandbox: github.com/agentscope-...
1.1 What - 整体架构
背景问题:
scss
Skill需要的能力:
├─ 执行Shell脚本 (故障诊断、日志分析)
├─ 调用系统工具 (grep、awk、cat等)
├─ 读取文件系统 (日志、配置、数据)
├─ 管理独立环境 (隔离、安全、可复现)
└─ 跨应用通信 (Agent Runtime as Server)
传统方式的问题:
❌ 安全隐患:直接执行系统命令
❌ 环境污染:影响主应用环境
❌ 难以管理:文件系统混乱
❌ 难以扩展:工具集成繁琐
解决方案:Runtime Sandbox
scss
┌─────────────────────────────────────────────────────────────┐
│ AgentScope Application │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Skill Management System ││
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││
│ │ │ Diagnosis │ │ Report Gen │ │ Custom Tools │ ││
│ │ │ Skill │ │ Skill │ │ Skill │ ││
│ │ └───────┬──────┘ └───────┬──────┘ └───────┬──────┘ ││
│ │ │ │ │ ││
│ │ └────────────────┼────────────────┘ ││
│ │ │ ││
│ │ Runtime Client (gRPC) ││
│ └───────────────────────────┼──────────────────────────────┘
│ │
│ ┌─────────▼──────────┐
│ │ Runtime Server │
│ │ (Sandbox Service) │
│ └─────────┬──────────┘
│ │
└──────────────────────────────┼──────────────────────────────
│
Network (可跨服务器)
│
┌──────────────────────▼──────────────────────┐
│ Isolated Sandbox Environment │
│ ┌────────────────────────────────────────┐ │
│ │ File System (/tmp/skill_workspace) │ │
│ │ ├─ /diagnostics/ │ │
│ │ │ ├─ scripts/ │ │
│ │ │ │ ├─ check_cpu.sh │ │
│ │ │ │ ├─ check_memory.sh │ │
│ │ │ │ └─ analyze_logs.sh │ │
│ │ │ ├─ reports/ │ │
│ │ │ │ └─ diagnosis_report.md │ │
│ │ │ └─ tools/ │ │
│ │ └─ /tools/ │ │
│ │ ├─ grep, cat, awk (系统工具) │ │
│ │ ├─ custom_analyzer (自定义工具) │ │
│ │ └─ ... │ │
│ └────────────────────────────────────────┘ │
│ │
│ Execution Engine │
│ ├─ Script Executor │
│ ├─ Tool Invoker │
│ └─ Resource Monitor │
└──────────────────────────────────────────────┘
24.1.2 Why - 为什么需要这样的集成
1. 安全隔离
bash
没有Sandbox:
Agent运行脚本 → 直接执行系统命令 → 可能删除/var、/etc等关键文件 ❌
使用Sandbox:
Agent运行脚本 → Sandbox环境中执行 → 隔离的文件系统 ✅
→ 脚本出错不影响主机系统
2. 文件系统组织
scss
Skill的文件组织问题:
- Skill A的脚本在/opt/scripts/a.sh
- Skill B的脚本在/opt/scripts/b.sh
- 报告模板在/var/templates/report.md
→ 难以管理、版本控制困难、容易冲突
解决方案(Sandbox):
/tmp/skill_workspace/
├─ skill_diagnosis/
│ ├─ scripts/ (诊断脚本)
│ ├─ tools/ (诊断工具)
│ ├─ cache/ (诊断缓存)
│ └─ reports/ (诊断报告)
├─ skill_reporting/
│ ├─ templates/ (报告模板)
│ ├─ tools/ (报告工具)
│ └─ outputs/ (生成的报告)
→ 清晰、易于版本控制、支持容器化
3. 工具管理
diff
传统方式:
- 系统工具(grep、cat、awk)直接访问
- 自定义工具需要安装到系统目录
- 难以更新和回滚
Sandbox方式:
- Sandbox内包含必要的系统工具
- 自定义工具动态挂载或编译到Sandbox
- 工具版本与Skill绑定,易于管理
4. Agent Runtime as Server
scss
分布式部署场景:
┌──────────────┐ ┌──────────────┐
│ Agent App 1 │ │ Agent App 2 │
│ (Service A)│ │ (Service B)│
└───────┬──────┘ └───────┬──────┘
│ │
└────────┬───────────────┘
│
gRPC/HTTP (跨网络)
│
┌────────▼────────┐
│ Runtime Server │
│ (Shared) │
└─────────────────┘
优势:
✅ 多个应用共享一个Runtime Server
✅ 集中管理Sandbox资源
✅ 文件系统统一管理
✅ 成本降低
1.3 Runtime Sandbox架构详解
java
/**
* Runtime Sandbox核心架构
*/
// 1. Sandbox配置
public class SandboxConfig {
private String sandboxId; // 沙箱唯一ID
private String workspaceRoot; // 工作目录 /tmp/skill_workspace
private long maxMemory; // 最大内存 512MB
private long maxCpuTime; // 最大CPU时间 30s
private List<String> allowedTools; // 允许的工具列表
private Map<String, String> env; // 环境变量
}
// 2. 文件系统隔离
public class IsolatedFileSystem {
private File workspaceRoot;
private Map<String, SkillFileSystemLayout> skillLayouts;
// 每个Skill有独立的文件结构
public static class SkillFileSystemLayout {
public File scriptsDir; // /workspace/skill_name/scripts/
public File toolsDir; // /workspace/skill_name/tools/
public File cacheDir; // /workspace/skill_name/cache/
public File outputDir; // /workspace/skill_name/output/
public File configDir; // /workspace/skill_name/config/
}
}
// 3. 脚本执行器
public class ScriptExecutor {
public ExecutionResult execute(
String sandboxId,
String scriptPath,
Map<String, String> env,
long timeout
) {
// 在Sandbox中执行脚本
// 返回:exitCode, stdout, stderr, executionTime
}
}
// 4. 工具注册表
public class ToolRegistry {
private Map<String, ToolDefinition> tools;
// 系统工具:grep, cat, awk, sed, tail等
// 自定义工具:custom_analyzer, log_parser等
public ToolDefinition registerTool(String name, String path, String version) {
// 注册工具到Sandbox
}
}
// 5. 资源监控
public class ResourceMonitor {
public void monitorExecution(
String sandboxId,
long timeout,
long maxMemory
) {
// 监控CPU、内存、磁盘等资源
// 超过限制时自动杀死进程
}
}
2. Skill的文件系统组织
2.1 标准的Skill文件结构
bash
/tmp/skill_workspace/
│
├─ skill_aiops_diagnostics/ # AIOps诊断Skill
│ ├─ skill_manifest.yaml # Skill元数据
│ ├─ scripts/ # 诊断脚本
│ │ ├─ cpu_diagnostics.sh # CPU诊断
│ │ ├─ memory_diagnostics.sh # 内存诊断
│ │ ├─ disk_diagnostics.sh # 磁盘诊断
│ │ ├─ network_diagnostics.sh # 网络诊断
│ │ ├─ log_analysis.sh # 日志分析
│ │ └─ correlation_analysis.py # 故障关联分析
│ │
│ ├─ tools/ # 工具集
│ │ ├─ system/ # 系统工具
│ │ │ ├─ grep # 文本搜索
│ │ │ ├─ awk # 文本处理
│ │ │ ├─ sed # 流编辑器
│ │ │ └─ tail # 文件尾部
│ │ │
│ │ └─ custom/ # 自定义工具
│ │ ├─ log_parser # 日志解析器
│ │ ├─ metric_aggregator # 指标聚合器
│ │ └─ anomaly_detector # 异常检测器
│ │
│ ├─ cache/ # 缓存目录
│ │ ├─ metrics_cache/ # 指标缓存
│ │ └─ analysis_cache/ # 分析缓存
│ │
│ ├─ reports/ # 报告输出
│ │ ├─ diagnosis_reports/ # 诊断报告
│ │ └─ analysis_results/ # 分析结果
│ │
│ └─ config/ # 配置文件
│ ├─ diagnostic_rules.yaml # 诊断规则
│ ├─ thresholds.yaml # 告警阈值
│ └─ tool_config.yaml # 工具配置
│
├─ skill_report_generation/ # 报告生成Skill
│ ├─ skill_manifest.yaml
│ ├─ templates/ # 报告模板
│ │ ├─ executive_summary.md # 执行摘要模板
│ │ ├─ detailed_analysis.md # 详细分析模板
│ │ ├─ recommendations.md # 建议模板
│ │ └─ styles/
│ │ └─ markdown.css # 样式
│ │
│ ├─ generators/ # 报告生成器
│ │ ├─ pdf_generator.py # PDF生成
│ │ ├─ html_generator.py # HTML生成
│ │ └─ markdown_generator.py # Markdown生成
│ │
│ ├─ outputs/ # 生成的报告
│ │ ├─ diagnosis_report_2025_01_04.pdf
│ │ ├─ diagnosis_report_2025_01_04.html
│ │ └─ diagnosis_report_2025_01_04.md
│ │
│ └─ config/
│ ├─ template_config.yaml # 模板配置
│ └─ style_config.yaml # 样式配置
│
└─ shared_tools/ # 共享工具
├─ log_processor # 日志处理工具
├─ data_analyzer # 数据分析工具
└─ report_formatter # 报告格式化工具
2.2 Skill Manifest定义
yaml
# skill_manifest.yaml - Skill元数据
apiVersion: v1
kind: SkillManifest
metadata:
name: aiops_diagnostics
version: "1.0.0"
description: "AIOps故障诊断Skill"
author: "Platform Team"
lastModified: "2025-01-04"
spec:
# Skill基本信息
skillType: "diagnostic" # diagnostic / reporting / analysis
# 文件系统需求
filesystem:
workspaceSize: "1GB"
persistentDirs:
- cache/
- config/
tempDirs:
- reports/
# 脚本定义
scripts:
- name: "cpu_diagnostics"
path: "scripts/cpu_diagnostics.sh"
language: "bash"
timeout: "30s"
inputs:
- name: "duration"
type: "integer"
description: "诊断时长(秒)"
outputs:
- name: "cpu_report"
type: "json"
description: "CPU诊断结果"
- name: "log_analysis"
path: "scripts/log_analysis.sh"
language: "bash"
timeout: "60s"
inputs:
- name: "log_file"
type: "string"
description: "日志文件路径"
- name: "keywords"
type: "array"
description: "关键词列表"
outputs:
- name: "analysis_result"
type: "json"
description: "分析结果"
- name: "correlation_analysis"
path: "scripts/correlation_analysis.py"
language: "python"
timeout: "120s"
inputs:
- name: "metrics_data"
type: "json"
description: "指标数据"
outputs:
- name: "correlation_report"
type: "json"
description: "关联分析报告"
# 工具依赖
tools:
system:
- name: "grep"
version: "*"
required: true
- name: "awk"
version: "*"
required: true
- name: "sed"
version: "*"
required: false
- name: "tail"
version: "*"
required: true
custom:
- name: "log_parser"
path: "tools/custom/log_parser"
version: "1.0.0"
required: true
- name: "metric_aggregator"
path: "tools/custom/metric_aggregator"
version: "1.0.0"
required: true
# 环境变量
environment:
JAVA_HOME: "/usr/lib/jvm/java-17"
LOG_LEVEL: "INFO"
CACHE_DIR: "${WORKSPACE}/cache"
REPORT_DIR: "${WORKSPACE}/reports"
# 资源限制
resources:
cpu: "2" # 最多2个核心
memory: "512Mi" # 最多512MB
timeout: "300s" # 最多5分钟
diskSpace: "1Gi" # 最多1GB磁盘
# 安全权限
security:
readPaths:
- "/var/log/"
- "/proc/stat"
- "/proc/meminfo"
writePaths:
- "${WORKSPACE}/cache/"
- "${WORKSPACE}/reports/"
networkAccess: false # 不允许网络访问
executablePaths:
- "scripts/"
- "tools/"
3. AIOps故障诊断Skill实现
3.1 Skill基础框架
java
/**
* AIOps故障诊断Skill基类
*/
public abstract class AIOpsDiagnosticSkill extends AgentSkill {
protected static final Logger logger = LoggerFactory.getLogger(AIOpsDiagnosticSkill.class);
// Runtime Sandbox客户端
protected final SandboxClient sandboxClient;
// Skill文件系统
protected final SkillFileSystem fileSystem;
// 工具注册表
protected final ToolRegistry toolRegistry;
// 诊断规则
protected final DiagnosticRuleEngine ruleEngine;
public AIOpsDiagnosticSkill(
String skillName,
SandboxClient sandboxClient,
SkillFileSystem fileSystem,
ToolRegistry toolRegistry) {
super(skillName);
this.sandboxClient = sandboxClient;
this.fileSystem = fileSystem;
this.toolRegistry = toolRegistry;
this.ruleEngine = new DiagnosticRuleEngine(this.fileSystem);
}
/**
* 执行诊断脚本
*/
protected ExecutionResult executeScript(
String scriptName,
Map<String, String> inputs,
long timeout) throws Exception {
logger.info("执行诊断脚本: {}", scriptName);
// 获取脚本路径
String scriptPath = fileSystem.getScriptPath(scriptName);
// 构建脚本命令
String command = buildCommand(scriptPath, inputs);
// 在Sandbox中执行
ExecutionResult result = sandboxClient.executeScript(
getSandboxId(),
command,
timeout,
getEnvironmentVariables()
);
logger.info("脚本执行完成: exitCode={}, duration={}ms",
result.getExitCode(), result.getExecutionTime());
return result;
}
/**
* 调用系统工具
*/
protected String invokeTool(
String toolName,
List<String> args) throws Exception {
logger.debug("调用工具: {} with args: {}", toolName, args);
ToolDefinition tool = toolRegistry.getTool(toolName);
if (tool == null) {
throw new ToolNotFoundException(toolName);
}
// 构建完整命令
List<String> command = new ArrayList<>();
command.add(tool.getPath());
command.addAll(args);
// 执行命令
ExecutionResult result = sandboxClient.execute(
getSandboxId(),
String.join(" ", command),
30000 // 30秒超时
);
if (result.getExitCode() != 0) {
logger.error("工具执行失败: {}", result.getStderr());
throw new ToolExecutionException(toolName, result.getStderr());
}
return result.getStdout();
}
/**
* 读取文件
*/
protected String readFile(String filePath) throws Exception {
String fullPath = fileSystem.resolvePath(filePath);
// 使用cat工具读取文件
return invokeTool("cat", List.of(fullPath));
}
/**
* 分析日志文件
*/
protected LogAnalysisResult analyzeLog(
String logFilePath,
List<String> keywords) throws Exception {
logger.info("分析日志文件: {}", logFilePath);
// 执行日志分析脚本
Map<String, String> inputs = Map.of(
"log_file", logFilePath,
"keywords", String.join(",", keywords)
);
ExecutionResult result = executeScript("log_analysis", inputs, 60000);
// 解析结果
return parseAnalysisResult(result.getStdout());
}
protected abstract String buildCommand(String scriptPath, Map<String, String> inputs);
protected abstract Map<String, String> getEnvironmentVariables();
protected abstract String getSandboxId();
protected abstract LogAnalysisResult parseAnalysisResult(String output);
}
/**
* 执行结果
*/
@Data
@Builder
class ExecutionResult {
private String sandboxId;
private int exitCode;
private String stdout;
private String stderr;
private long executionTime;
private Map<String, String> metrics; // CPU、内存等
}
/**
* 日志分析结果
*/
@Data
@Builder
class LogAnalysisResult {
private List<LogEntry> errorLines;
private List<LogEntry> warningLines;
private Map<String, Integer> keywordCounts;
private String summary;
private LocalDateTime analysisTime;
}
@Data
@Builder
class LogEntry {
private int lineNumber;
private String timestamp;
private String level;
private String message;
private String context;
}
3.2 CPU诊断脚本
bash
#!/bin/bash
# scripts/cpu_diagnostics.sh - CPU诊断脚本
set -e
DURATION=${1:-30}
OUTPUT_FILE="/workspace/reports/cpu_diagnostics_$(date +%s).json"
echo "开始CPU诊断(${DURATION}秒)..."
# 创建JSON报告
{
echo "{"
echo " \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\","
echo " \"diagnostics\": {"
# 1. 获取CPU核心数
CPU_CORES=$(grep -c "^processor" /proc/cpuinfo)
echo " \"cpu_cores\": $CPU_CORES,"
# 2. 获取CPU型号
CPU_MODEL=$(grep "^model name" /proc/cpuinfo | head -1 | cut -d: -f2 | xargs)
echo " \"cpu_model\": \"$CPU_MODEL\","
# 3. 获取当前CPU使用率
echo " \"current_usage\": {"
# 使用top命令获取平均负载
IFS=' ' read -r LA1 LA5 LA15 < /proc/loadavg
echo " \"load_average\": {"
echo " \"1m\": $LA1,"
echo " \"5m\": $LA5,"
echo " \"15m\": $LA15"
echo " },"
# 获取每个CPU核的使用率
echo " \"per_cpu\": ["
FIRST=true
while IFS= read -r line; do
if [[ $line == cpu[0-9]* ]]; then
if [ "$FIRST" = false ]; then echo ","; fi
FIRST=false
# 解析CPU统计信息
FIELDS=($line)
USER=${FIELDS[1]}
NICE=${FIELDS[2]}
SYSTEM=${FIELDS[3]}
IDLE=${FIELDS[4]}
TOTAL=$((USER + NICE + SYSTEM + IDLE))
USAGE=$((100 * (TOTAL - IDLE) / TOTAL))
echo " {"
echo " \"cpu\": \"${FIELDS[0]}\","
echo " \"usage_percent\": $USAGE"
echo " }" | tr -d '\n'
fi
done < /proc/stat
echo ""
echo " ]"
echo " },"
# 4. 诊断高CPU使用的进程
echo " \"top_processes\": ["
FIRST=true
ps aux --sort=-%cpu | tail -n +2 | head -5 | while read -r line; do
if [ "$FIRST" = false ]; then echo ","; fi
FIRST=false
IFS=' ' read -r USER PID PCPU PMEM VSZ RSS TTY STAT START TIME CMD <<< "$line"
echo " {"
echo " \"pid\": $PID,"
echo " \"user\": \"$USER\","
echo " \"cpu_percent\": $PCPU,"
echo " \"memory_percent\": $PMEM,"
echo " \"command\": \"$(echo $CMD | cut -d' ' -f1)\""
echo " }" | tr -d '\n'
done
echo ""
echo " ]"
echo " }"
echo " },"
# 5. 诊断建议
echo " \"recommendations\": ["
if (( $(echo "$LA1 > $CPU_CORES" | bc -l) )); then
echo " \"警告:1分钟平均负载高于CPU核心数,系统可能过载\","
fi
if (( $(echo "$LA5 > $CPU_CORES * 0.8" | bc -l) )); then
echo " \"提示:5分钟平均负载较高,建议检查后台任务\","
fi
echo " \"建议定期监控CPU使用率\""
echo " ]"
echo "}"
} > "$OUTPUT_FILE"
# 输出结果路径
echo "诊断完成,结果保存到: $OUTPUT_FILE"
cat "$OUTPUT_FILE"
3.3 内存诊断Skill实现
java
/**
* 内存诊断Skill
*/
public class MemoryDiagnosticSkill extends AIOpsDiagnosticSkill {
private static final Logger logger = LoggerFactory.getLogger(MemoryDiagnosticSkill.class);
private final MetricAggregator metricAggregator;
public MemoryDiagnosticSkill(
SandboxClient sandboxClient,
SkillFileSystem fileSystem,
ToolRegistry toolRegistry) {
super("memory_diagnostic", sandboxClient, fileSystem, toolRegistry);
this.metricAggregator = new MetricAggregator();
}
/**
* 执行内存诊断
*/
public MemoryDiagnosticReport diagnose() throws Exception {
logger.info("开始内存诊断...");
MemoryDiagnosticReport.Builder reportBuilder = MemoryDiagnosticReport.builder()
.timestamp(LocalDateTime.now());
try {
// Step 1: 获取内存统计
MemoryStatistics stats = getMemoryStatistics();
reportBuilder.statistics(stats);
logger.info("内存统计: 总内存={}MB, 已用={}MB, 可用={}MB",
stats.getTotalMemory() / 1024 / 1024,
stats.getUsedMemory() / 1024 / 1024,
stats.getAvailableMemory() / 1024 / 1024);
// Step 2: 分析内存使用趋势
List<MemoryMetric> trends = analyzeMemoryTrends();
reportBuilder.trends(trends);
// Step 3: 识别内存泄漏
MemoryLeakDetection leakDetection = detectMemoryLeaks();
reportBuilder.leakDetection(leakDetection);
// Step 4: 诊断高内存进程
List<ProcessMemoryInfo> topProcesses = getTopMemoryConsumers(5);
reportBuilder.topProcesses(topProcesses);
// Step 5: 生成建议
List<String> recommendations = generateRecommendations(stats, leakDetection);
reportBuilder.recommendations(recommendations);
reportBuilder.status("SUCCESS");
} catch (Exception e) {
logger.error("内存诊断失败", e);
reportBuilder.status("FAILED").error(e.getMessage());
}
return reportBuilder.build();
}
/**
* 获取内存统计信息
*/
private MemoryStatistics getMemoryStatistics() throws Exception {
// 调用系统工具获取内存信息
String meminfo = readFile("/proc/meminfo");
// 解析/proc/meminfo
Map<String, Long> memStats = new HashMap<>();
for (String line : meminfo.split("\n")) {
if (line.trim().isEmpty()) continue;
// 格式: MemTotal: 16334732 kB
String[] parts = line.split(":");
if (parts.length == 2) {
String key = parts[0].trim();
String value = parts[1].trim().split("\\s+")[0];
memStats.put(key, Long.parseLong(value) * 1024); // 转换为字节
}
}
return MemoryStatistics.builder()
.totalMemory(memStats.getOrDefault("MemTotal", 0L))
.usedMemory(memStats.getOrDefault("MemTotal", 0L) -
memStats.getOrDefault("MemAvailable", 0L))
.availableMemory(memStats.getOrDefault("MemAvailable", 0L))
.buffers(memStats.getOrDefault("Buffers", 0L))
.cached(memStats.getOrDefault("Cached", 0L))
.swapTotal(memStats.getOrDefault("SwapTotal", 0L))
.swapUsed(memStats.getOrDefault("SwapTotal", 0L) -
memStats.getOrDefault("SwapFree", 0L))
.timestamp(LocalDateTime.now())
.build();
}
/**
* 分析内存使用趋势
*/
private List<MemoryMetric> analyzeMemoryTrends() throws Exception {
logger.info("分析内存使用趋势...");
// 获取缓存中的历史数据
File cacheDir = fileSystem.getCacheDir();
List<MemoryMetric> trends = new ArrayList<>();
// 读取历史指标(JSON格式)
File[] cacheFiles = cacheDir.listFiles((dir, name) -> name.startsWith("memory_") && name.endsWith(".json"));
if (cacheFiles != null) {
Arrays.sort(cacheFiles, Comparator.comparingLong(File::lastModified));
for (File cacheFile : cacheFiles) {
MemoryMetric metric = parseMemoryMetric(readFile(cacheFile.getAbsolutePath()));
trends.add(metric);
}
}
// 保存当前指标到缓存
MemoryMetric currentMetric = MemoryMetric.builder()
.timestamp(LocalDateTime.now())
.memoryUsagePercent(getCurrentMemoryUsagePercent())
.build();
String cacheFile = cacheDir.getAbsolutePath() + "/memory_" + System.currentTimeMillis() + ".json";
saveMetricToCache(cacheFile, currentMetric);
trends.add(currentMetric);
return trends;
}
/**
* 检测内存泄漏
*/
private MemoryLeakDetection detectMemoryLeaks() throws Exception {
logger.info("检测内存泄漏...");
// 执行自定义的内存泄漏检测工具
String result = invokeTool("memory_leak_detector", List.of(
"--duration", "60",
"--threshold", "80"
));
// 解析结果
return parseLeakDetectionResult(result);
}
/**
* 获取内存使用最高的进程
*/
private List<ProcessMemoryInfo> getTopMemoryConsumers(int topN) throws Exception {
// 使用grep和awk处理ps输出
String psOutput = invokeTool("ps", Arrays.asList(
"aux", "--sort=-rss"
));
List<ProcessMemoryInfo> processes = new ArrayList<>();
String[] lines = psOutput.split("\n");
for (int i = 1; i < Math.min(lines.length, topN + 1); i++) {
String[] fields = lines[i].trim().split("\\s+");
ProcessMemoryInfo info = ProcessMemoryInfo.builder()
.pid(Integer.parseInt(fields[1]))
.user(fields[0])
.memoryPercent(Double.parseDouble(fields[3]))
.memoryRss(Long.parseLong(fields[5]) * 1024) // 转换为字节
.command(fields[10])
.build();
processes.add(info);
}
return processes;
}
/**
* 生成诊断建议
*/
private List<String> generateRecommendations(
MemoryStatistics stats,
MemoryLeakDetection leakDetection) {
List<String> recommendations = new ArrayList<>();
double usagePercent = (double) stats.getUsedMemory() / stats.getTotalMemory() * 100;
if (usagePercent > 90) {
recommendations.add("⚠️ 严重警告:内存使用率超过90%,系统可能面临OOM风险");
recommendations.add(" 建议:立即释放不必要的内存,考虑增加物理内存");
} else if (usagePercent > 80) {
recommendations.add("⚠️ 警告:内存使用率超过80%,性能可能下降");
recommendations.add(" 建议:监控内存使用情况,识别大内存应用");
} else if (usagePercent > 70) {
recommendations.add("📌 提示:内存使用率超过70%,建议定期清理");
}
if (leakDetection.isDetected()) {
recommendations.add("⚠️ 检测到可能的内存泄漏:" + leakDetection.getDescription());
recommendations.add(" 建议:分析相关应用的日志和堆转储");
}
if (stats.getSwapUsed() > 0) {
double swapPercent = (double) stats.getSwapUsed() / stats.getSwapTotal() * 100;
if (swapPercent > 50) {
recommendations.add("⚠️ 警告:Swap使用率过高(" + String.format("%.1f", swapPercent) + "%),性能严重下降");
recommendations.add(" 建议:检查是否存在内存泄漏或不合理的内存占用");
}
}
recommendations.add("💡 建议:定期运行内存诊断,及时发现潜在问题");
return recommendations;
}
@Override
protected String buildCommand(String scriptPath, Map<String, String> inputs) {
return scriptPath; // 直接执行脚本
}
@Override
protected Map<String, String> getEnvironmentVariables() {
return Map.of(
"JAVA_HOME", "/usr/lib/jvm/java-17",
"LOG_LEVEL", "INFO"
);
}
@Override
protected String getSandboxId() {
return "aiops_sandbox_memory";
}
@Override
protected LogAnalysisResult parseAnalysisResult(String output) {
// 实现日志分析结果解析
return null;
}
// 辅助方法...
private double getCurrentMemoryUsagePercent() throws Exception {
MemoryStatistics stats = getMemoryStatistics();
return (double) stats.getUsedMemory() / stats.getTotalMemory() * 100;
}
private MemoryMetric parseMemoryMetric(String json) {
// 解析JSON格式的内存指标
return MemoryMetric.builder().build();
}
private void saveMetricToCache(String path, MemoryMetric metric) throws Exception {
// 保存指标到缓存
}
private MemoryLeakDetection parseLeakDetectionResult(String result) {
return MemoryLeakDetection.builder()
.detected(result.contains("leak"))
.description(result)
.build();
}
}
// 数据模型
@Data
@Builder
class MemoryDiagnosticReport {
private LocalDateTime timestamp;
private MemoryStatistics statistics;
private List<MemoryMetric> trends;
private MemoryLeakDetection leakDetection;
private List<ProcessMemoryInfo> topProcesses;
private List<String> recommendations;
private String status;
private String error;
}
@Data
@Builder
class MemoryStatistics {
private long totalMemory;
private long usedMemory;
private long availableMemory;
private long buffers;
private long cached;
private long swapTotal;
private long swapUsed;
private LocalDateTime timestamp;
}
@Data
@Builder
class MemoryMetric {
private LocalDateTime timestamp;
private double memoryUsagePercent;
}
@Data
@Builder
class MemoryLeakDetection {
private boolean detected;
private String description;
}
@Data
@Builder
class ProcessMemoryInfo {
private int pid;
private String user;
private double memoryPercent;
private long memoryRss;
private String command;
}
4. 自定义工具的开发与集成
4.1 自定义工具框架
java
/**
* 自定义工具基类
*/
public abstract class CustomTool {
protected final Logger logger = LoggerFactory.getLogger(getClass());
protected final ToolContext context;
public CustomTool(ToolContext context) {
this.context = context;
}
/**
* 工具元数据
*/
public abstract ToolMetadata getMetadata();
/**
* 执行工具
*/
public abstract ToolResult execute(Map<String, String> args) throws Exception;
/**
* 工具元数据
*/
@Data
@Builder
public static class ToolMetadata {
private String name;
private String version;
private String description;
private List<ParameterSpec> parameters;
private String outputFormat; // json / text / binary
}
@Data
@Builder
public static class ParameterSpec {
private String name;
private String type; // string / int / boolean / array
private String description;
private boolean required;
private String defaultValue;
}
@Data
@Builder
public static class ToolResult {
private int exitCode;
private String output;
private String error;
private Map<String, Object> metrics; // 执行指标
}
}
/**
* 工具上下文
*/
@Data
public class ToolContext {
private String toolId;
private String sandboxId;
private File workspaceRoot;
private Map<String, String> environment;
private long timeout;
}
/**
* 日志解析工具
*/
public class LogParserTool extends CustomTool {
public LogParserTool(ToolContext context) {
super(context);
}
@Override
public ToolMetadata getMetadata() {
return ToolMetadata.builder()
.name("log_parser")
.version("1.0.0")
.description("高性能日志解析和分析工具")
.parameters(Arrays.asList(
ParameterSpec.builder()
.name("log_file")
.type("string")
.description("日志文件路径")
.required(true)
.build(),
ParameterSpec.builder()
.name("pattern")
.type("string")
.description("匹配模式(正则表达式)")
.required(false)
.build(),
ParameterSpec.builder()
.name("level")
.type("string")
.description("日志级别(ERROR/WARN/INFO)")
.required(false)
.defaultValue("*")
.build(),
ParameterSpec.builder()
.name("max_lines")
.type("int")
.description("最多返回行数")
.required(false)
.defaultValue("1000")
.build()
))
.outputFormat("json")
.build();
}
@Override
public ToolResult execute(Map<String, String> args) throws Exception {
logger.info("执行日志解析: file={}", args.get("log_file"));
String logFile = args.get("log_file");
String pattern = args.getOrDefault("pattern", ".*");
String level = args.getOrDefault("level", "*");
int maxLines = Integer.parseInt(args.getOrDefault("max_lines", "1000"));
// 读取日志文件
List<String> lines = readLogFile(logFile);
// 解析日志
List<LogEntry> entries = parseLogLines(lines, pattern, level, maxLines);
// 统计信息
Map<String, Integer> statistics = computeStatistics(entries);
// 生成JSON输出
String output = generateJsonOutput(entries, statistics);
return ToolResult.builder()
.exitCode(0)
.output(output)
.metrics(Map.of("lines_processed", lines.size(), "entries_matched", entries.size()))
.build();
}
private List<String> readLogFile(String filePath) throws Exception {
Path path = Paths.get(filePath);
return Files.readAllLines(path);
}
private List<LogEntry> parseLogLines(
List<String> lines,
String pattern,
String level,
int maxLines) throws Exception {
List<LogEntry> entries = new ArrayList<>();
Pattern regex = Pattern.compile(pattern);
for (String line : lines) {
if (entries.size() >= maxLines) break;
Matcher matcher = regex.matcher(line);
if (matcher.find()) {
// 解析日志格式(支持常见格式)
LogEntry entry = parseSingleLogEntry(line);
if ("*".equals(level) || line.contains(level)) {
entries.add(entry);
}
}
}
return entries;
}
private LogEntry parseSingleLogEntry(String line) {
// 简单的日志解析实现
// 支持格式: [2025-01-04 10:30:45] ERROR [com.example.Service] Message
String[] parts = line.split("\\]");
String timestamp = parts.length > 0 ? parts[0].substring(1) : "";
String levelAndMessage = parts.length > 1 ? parts[1].trim() : "";
return LogEntry.builder()
.timestamp(timestamp)
.message(levelAndMessage)
.build();
}
private Map<String, Integer> computeStatistics(List<LogEntry> entries) {
Map<String, Integer> stats = new HashMap<>();
stats.put("total_entries", entries.size());
// 统计各级别日志
entries.stream()
.filter(e -> e.getMessage().contains("ERROR"))
.count();
return stats;
}
private String generateJsonOutput(List<LogEntry> entries, Map<String, Integer> stats) {
// 返回JSON格式的分析结果
return new ObjectMapper().writeValueAsString(Map.of(
"entries", entries,
"statistics", stats
));
}
}
/**
* 指标聚合工具
*/
public class MetricAggregatorTool extends CustomTool {
public MetricAggregatorTool(ToolContext context) {
super(context);
}
@Override
public ToolMetadata getMetadata() {
return ToolMetadata.builder()
.name("metric_aggregator")
.version("1.0.0")
.description("系统指标聚合和统计工具")
.parameters(Arrays.asList(
ParameterSpec.builder()
.name("metric_file")
.type("string")
.description("指标文件路径")
.required(true)
.build(),
ParameterSpec.builder()
.name("aggregate_type")
.type("string")
.description("聚合类型(sum/avg/min/max/percentile)")
.required(true)
.build()
))
.outputFormat("json")
.build();
}
@Override
public ToolResult execute(Map<String, String> args) throws Exception {
String metricFile = args.get("metric_file");
String aggregateType = args.get("aggregate_type");
logger.info("聚合指标: file={}, type={}", metricFile, aggregateType);
// 读取指标文件
List<Double> metrics = readMetricFile(metricFile);
// 执行聚合
Map<String, Double> result = aggregate(metrics, aggregateType);
String output = new ObjectMapper().writeValueAsString(result);
return ToolResult.builder()
.exitCode(0)
.output(output)
.metrics(Map.of("metrics_count", metrics.size()))
.build();
}
private List<Double> readMetricFile(String filePath) throws Exception {
// 读取指标文件
return new ArrayList<>();
}
private Map<String, Double> aggregate(List<Double> metrics, String type) {
Map<String, Double> result = new HashMap<>();
if ("sum".equals(type)) {
result.put("sum", metrics.stream().mapToDouble(Double::doubleValue).sum());
} else if ("avg".equals(type)) {
result.put("average", metrics.stream().mapToDouble(Double::doubleValue).average().orElse(0));
} else if ("min".equals(type)) {
result.put("min", metrics.stream().mapToDouble(Double::doubleValue).min().orElse(0));
} else if ("max".equals(type)) {
result.put("max", metrics.stream().mapToDouble(Double::doubleValue).max().orElse(0));
}
return result;
}
}
4.2 工具注册和管理
java
/**
* 工具注册表和管理器
*/
public class ToolRegistry {
private final Map<String, ToolDefinition> systemTools = new ConcurrentHashMap<>();
private final Map<String, CustomTool> customTools = new ConcurrentHashMap<>();
public ToolRegistry() {
initializeSystemTools();
}
/**
* 初始化系统工具
*/
private void initializeSystemTools() {
// 注册grep
registerSystemTool(ToolDefinition.builder()
.name("grep")
.path("/bin/grep")
.type("system")
.version("2.28")
.description("文本搜索工具")
.build());
// 注册cat
registerSystemTool(ToolDefinition.builder()
.name("cat")
.path("/bin/cat")
.type("system")
.version("8.32")
.description("文件查看工具")
.build());
// 注册awk
registerSystemTool(ToolDefinition.builder()
.name("awk")
.path("/usr/bin/awk")
.type("system")
.version("5.1.0")
.description("文本处理工具")
.build());
// 注册sed
registerSystemTool(ToolDefinition.builder()
.name("sed")
.path("/bin/sed")
.type("system")
.version("4.7")
.description("流编辑器")
.build());
// 注册tail
registerSystemTool(ToolDefinition.builder()
.name("tail")
.path("/usr/bin/tail")
.type("system")
.version("8.32")
.description("查看文件末尾")
.build());
// 注册ps
registerSystemTool(ToolDefinition.builder()
.name("ps")
.path("/bin/ps")
.type("system")
.version("3.3.17")
.description("进程信息查看")
.build());
}
/**
* 注册系统工具
*/
public void registerSystemTool(ToolDefinition toolDef) {
systemTools.put(toolDef.getName(), toolDef);
logger.info("已注册系统工具: {} ({})", toolDef.getName(), toolDef.getPath());
}
/**
* 注册自定义工具
*/
public void registerCustomTool(String name, CustomTool tool) {
customTools.put(name, tool);
ToolMetadata metadata = tool.getMetadata();
logger.info("已注册自定义工具: {} v{}", metadata.getName(), metadata.getVersion());
}
/**
* 获取工具
*/
public ToolDefinition getTool(String name) {
return systemTools.get(name);
}
/**
* 获取自定义工具
*/
public CustomTool getCustomTool(String name) {
return customTools.get(name);
}
/**
* 验证工具是否存在
*/
public boolean hasTool(String name) {
return systemTools.containsKey(name) || customTools.containsKey(name);
}
/**
* 列出所有工具
*/
public List<String> listAllTools() {
List<String> tools = new ArrayList<>();
tools.addAll(systemTools.keySet());
tools.addAll(customTools.keySet());
return tools;
}
}
@Data
@Builder
class ToolDefinition {
private String name;
private String path;
private String type; // system / custom
private String version;
private String description;
}
5. 报告生成Skill
5.1 报告模板管理
java
/**
* 报告模板系统
*/
public class ReportTemplateEngine {
private final SkillFileSystem fileSystem;
private final TemplateResolver resolver;
public ReportTemplateEngine(SkillFileSystem fileSystem) {
this.fileSystem = fileSystem;
this.resolver = new TemplateResolver();
}
/**
* 生成报告
*/
public String generateReport(
String templateName,
Map<String, Object> data,
ReportFormat format) throws Exception {
logger.info("生成报告: template={}, format={}", templateName, format);
// 加载模板
String templateContent = loadTemplate(templateName);
// 渲染模板
String renderedContent = renderTemplate(templateContent, data);
// 转换格式
return convertFormat(renderedContent, format);
}
/**
* 加载模板
*/
private String loadTemplate(String templateName) throws Exception {
File templateFile = new File(
fileSystem.getTemplateDir(),
templateName + ".md"
);
return new String(Files.readAllBytes(templateFile.toPath()));
}
/**
* 渲染模板
*/
private String renderTemplate(String template, Map<String, Object> data) {
// 使用Freemarker或Velocity进行模板渲染
// 这里简化为简单的字符串替换
String result = template;
for (Map.Entry<String, Object> entry : data.entrySet()) {
String placeholder = "${" + entry.getKey() + "}";
result = result.replace(placeholder, String.valueOf(entry.getValue()));
}
return result;
}
/**
* 格式转换
*/
private String convertFormat(String content, ReportFormat format) throws Exception {
switch (format) {
case MARKDOWN:
return content;
case HTML:
return convertMarkdownToHtml(content);
case PDF:
return convertMarkdownToPdf(content);
default:
return content;
}
}
private String convertMarkdownToHtml(String markdown) {
// 使用commonmark库进行转换
return "<html><body>" + markdown + "</body></html>";
}
private String convertMarkdownToPdf(String markdown) throws Exception {
// 使用iText或wkhtmltopdf进行转换
return "PDF内容";
}
}
public enum ReportFormat {
MARKDOWN,
HTML,
PDF
}
5.2 诊断报告生成Skill
java
/**
* 诊断报告生成Skill
*/
public class DiagnosticReportGenerationSkill extends AgentSkill {
private static final Logger logger = LoggerFactory.getLogger(DiagnosticReportGenerationSkill.class);
private final ReportTemplateEngine templateEngine;
private final MemoryDiagnosticSkill memoryDiagnosticSkill;
private final SkillFileSystem fileSystem;
public DiagnosticReportGenerationSkill(
ReportTemplateEngine templateEngine,
MemoryDiagnosticSkill memoryDiagnosticSkill,
SkillFileSystem fileSystem) {
super("diagnostic_report_generation");
this.templateEngine = templateEngine;
this.memoryDiagnosticSkill = memoryDiagnosticSkill;
this.fileSystem = fileSystem;
}
/**
* 生成完整的诊断报告
*/
public DiagnosticReport generateFullReport() throws Exception {
logger.info("生成诊断报告...");
DiagnosticReport.Builder reportBuilder = DiagnosticReport.builder()
.timestamp(LocalDateTime.now())
.version("1.0");
try {
// Step 1: 执行诊断
MemoryDiagnosticReport memoryReport = memoryDiagnosticSkill.diagnose();
reportBuilder.memoryReport(memoryReport);
// Step 2: 准备报告数据
Map<String, Object> reportData = prepareReportData(memoryReport);
// Step 3: 生成Markdown报告
String markdownContent = templateEngine.generateReport(
"diagnostic_template",
reportData,
ReportFormat.MARKDOWN
);
reportBuilder.markdownContent(markdownContent);
// Step 4: 转换为其他格式
String htmlContent = templateEngine.generateReport(
"diagnostic_template",
reportData,
ReportFormat.HTML
);
reportBuilder.htmlContent(htmlContent);
// Step 5: 保存报告文件
String reportPath = saveReport(markdownContent, htmlContent);
reportBuilder.reportPath(reportPath);
reportBuilder.status("SUCCESS");
} catch (Exception e) {
logger.error("报告生成失败", e);
reportBuilder.status("FAILED").error(e.getMessage());
}
return reportBuilder.build();
}
/**
* 准备报告数据
*/
private Map<String, Object> prepareReportData(MemoryDiagnosticReport memoryReport) {
Map<String, Object> data = new HashMap<>();
// 执行摘要
data.put("report_title", "AIOps诊断报告");
data.put("report_date", LocalDate.now().format(DateTimeFormatter.ISO_DATE));
data.put("report_time", LocalTime.now().format(DateTimeFormatter.ofPattern("HH:mm:ss")));
// 内存诊断结果
MemoryStatistics stats = memoryReport.getStatistics();
data.put("total_memory", formatBytes(stats.getTotalMemory()));
data.put("used_memory", formatBytes(stats.getUsedMemory()));
data.put("memory_usage_percent", String.format("%.1f%%",
(double) stats.getUsedMemory() / stats.getTotalMemory() * 100));
// 诊断建议
data.put("recommendations", memoryReport.getRecommendations());
// 进程信息
List<Map<String, Object>> processesData = new ArrayList<>();
for (ProcessMemoryInfo proc : memoryReport.getTopProcesses()) {
Map<String, Object> procData = new HashMap<>();
procData.put("pid", proc.getPid());
procData.put("user", proc.getUser());
procData.put("memory", formatBytes(proc.getMemoryRss()));
procData.put("command", proc.getCommand());
processesData.add(procData);
}
data.put("top_processes", processesData);
return data;
}
/**
* 保存报告文件
*/
private String saveReport(String markdownContent, String htmlContent) throws Exception {
String timestamp = LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMdd_HHmmss"));
String filename = "diagnostic_report_" + timestamp;
File reportsDir = fileSystem.getOutputDir();
// 保存Markdown
Files.write(
new File(reportsDir, filename + ".md").toPath(),
markdownContent.getBytes(StandardCharsets.UTF_8)
);
// 保存HTML
Files.write(
new File(reportsDir, filename + ".html").toPath(),
htmlContent.getBytes(StandardCharsets.UTF_8)
);
logger.info("报告已保存: {}", reportsDir.getAbsolutePath());
return reportsDir.getAbsolutePath() + "/" + filename;
}
/**
* 格式化字节大小
*/
private String formatBytes(long bytes) {
if (bytes <= 0) return "0 B";
final String[] units = new String[]{"B", "KB", "MB", "GB", "TB"};
int digitGroups = (int) (Math.log10(bytes) / Math.log10(1024));
return String.format("%.1f %s", bytes / Math.pow(1024, digitGroups), units[digitGroups]);
}
}
@Data
@Builder
class DiagnosticReport {
private LocalDateTime timestamp;
private String version;
private MemoryDiagnosticReport memoryReport;
private String markdownContent;
private String htmlContent;
private String reportPath;
private String status;
private String error;
}
6. AIOps诊断和报告生成的集成示例
java
/**
* AIOps诊断和报告生成的集成
*/
public class AIOpsDiagnosticSystem {
private static final Logger logger = LoggerFactory.getLogger(AIOpsDiagnosticSystem.class);
private final SandboxClient sandboxClient;
private final SkillFileSystem fileSystem;
private final ToolRegistry toolRegistry;
private final MemoryDiagnosticSkill memoryDiagnosticSkill;
private final DiagnosticReportGenerationSkill reportGenerationSkill;
public AIOpsDiagnosticSystem(String runtimeServerHost, int runtimeServerPort) throws Exception {
logger.info("初始化AIOps诊断系统...");
// 初始化Sandbox客户端
this.sandboxClient = new SandboxClient(runtimeServerHost, runtimeServerPort);
// 初始化文件系统
this.fileSystem = new SkillFileSystem("/tmp/skill_workspace");
fileSystem.initializeSkillDirs("aiops_diagnostics", "diagnostic_reporting");
// 初始化工具注册表
this.toolRegistry = new ToolRegistry();
registerCustomTools();
// 初始化诊断Skill
this.memoryDiagnosticSkill = new MemoryDiagnosticSkill(
sandboxClient, fileSystem, toolRegistry
);
// 初始化报告生成Skill
ReportTemplateEngine templateEngine = new ReportTemplateEngine(fileSystem);
this.reportGenerationSkill = new DiagnosticReportGenerationSkill(
templateEngine, memoryDiagnosticSkill, fileSystem
);
logger.info("✓ AIOps诊断系统初始化完成");
}
/**
* 注册自定义工具
*/
private void registerCustomTools() throws Exception {
ToolContext context = new ToolContext();
context.setWorkspaceRoot(fileSystem.getWorkspaceRoot());
context.setTimeout(30000);
// 注册日志解析工具
toolRegistry.registerCustomTool("log_parser", new LogParserTool(context));
// 注册指标聚合工具
toolRegistry.registerCustomTool("metric_aggregator", new MetricAggregatorTool(context));
logger.info("✓ 自定义工具注册完成");
}
/**
* 执行完整的诊断和报告流程
*/
public DiagnosticReport runFullDiagnostics() throws Exception {
logger.info("\n" + "=".repeat(60));
logger.info("开始AIOps诊断和报告生成");
logger.info("=".repeat(60));
try {
// Step 1: 执行诊断
logger.info("\n【Step 1】执行诊断...");
MemoryDiagnosticReport memoryReport = memoryDiagnosticSkill.diagnose();
logger.info("✓ 诊断完成,状态: {}", memoryReport.getStatus());
// Step 2: 生成报告
logger.info("\n【Step 2】生成报告...");
DiagnosticReport report = reportGenerationSkill.generateFullReport();
logger.info("✓ 报告生成完成,已保存到: {}", report.getReportPath());
// Step 3: 输出总结
logger.info("\n【Step 3】诊断总结");
logger.info("总内存: {}", formatBytes(memoryReport.getStatistics().getTotalMemory()));
logger.info("已用内存: {}", formatBytes(memoryReport.getStatistics().getUsedMemory()));
logger.info("内存使用率: {:.1f}%",
(double) memoryReport.getStatistics().getUsedMemory() /
memoryReport.getStatistics().getTotalMemory() * 100);
logger.info("\n诊断建议:");
for (String recommendation : memoryReport.getRecommendations()) {
logger.info(" {}", recommendation);
}
logger.info("\n" + "=".repeat(60));
logger.info("诊断和报告生成成功");
logger.info("=".repeat(60));
return report;
} catch (Exception e) {
logger.error("诊断失败", e);
throw e;
}
}
private String formatBytes(long bytes) {
if (bytes <= 0) return "0 B";
final String[] units = new String[]{"B", "KB", "MB", "GB", "TB"};
int digitGroups = (int) (Math.log10(bytes) / Math.log10(1024));
return String.format("%.1f %s", bytes / Math.pow(1024, digitGroups), units[digitGroups]);
}
public static void main(String[] args) throws Exception {
// 连接到Runtime Server
AIOpsDiagnosticSystem system = new AIOpsDiagnosticSystem(
"localhost", // Runtime Server主机
50051 // Runtime Server端口
);
// 运行诊断
DiagnosticReport report = system.runFullDiagnostics();
// 可以进一步处理报告...
}
}