基于SAST+AI代码审计架构与功能详解

Java-Audit 架构与功能详解

之前的代码审计项目由同事接手了，然后他改了很多东西。那理论上，我这水逼项目和产品也不搭边了，所以直接开了！

相对于之前优化的点

joern扫描结果的过滤： joern扫描出来的结果大部分都是噪音如果我们AI去读取那些结果以做分析会很浪费上下文。所以写了个过滤器，只保留AI用于审计的部分。
路由分析结果外部存储：将路由分析的结果写入到向量数据库中，其他Agent要分析的时候直接到向量库里边查找即可。（我感觉这个还能再优化，得空研究研究）

项目地址：https://github.com/sss12365/SAST2AI

项目细节下边就是AI生成的了

项目概述

Java-Audit 是一个基于 Joern CPG 静态分析 + Claude AI 语义分析的 Java Web 应用代码审计工具。CLI 界面，Python 实现，通过 OpenAI 兼容 API 调用 Claude 模型。

核心理念：Joern 负责精确的代码属性图分析（数据流追踪、Sink 发现），AI 负责 Joern 无法完成的语义理解（鉴权逻辑、业务漏洞、可控性推理）。两者结果通过 ChromaDB 向量数据库共享，Agent 按需检索，最大限度减少 token 消耗。

项目结构

复制代码

java-audit/ # 3956 行代码
├── main.py # CLI 入口 (click)
├── config.yaml # 配置文件
├── requirements.txt # Python 依赖
│
├── core/ # 核心引擎层
│ ├── config.py # YAML 配置加载 + Pydantic 校验
│ ├── models.py # 数据模型定义
│ ├── orchestrator.py # 三阶段异步编排器
│ ├── joern_runner.py # Joern CLI 交互封装
│ ├── joern_parser.py # Joern 原始输出压缩器 (99.4% 压缩率)
│ ├── java_compressor.py # Java 源码压缩器 (60-85% 压缩率)
│ └── audit_store.py # ChromaDB 向量知识库
│
├── agents/ # AI Agent 层
│ ├── base.py # BaseAgent (OpenAI SDK 封装)
│ ├── route_analysis.py # 路由分析 Agent
│ ├── route_param.py # 参数追踪 Agent
│ ├── auth_analysis.py # 鉴权分析 Agent
│ ├── hardcoded_audit.py # 硬编码审计 Agent
│ └── vuln_verification.py # 漏洞验证 Agent
│
├── prompts/ # Agent System Prompt
│ ├── route_analysis.md # 路由提取指令
│ ├── route_param.md # 参数追踪 + 可控性分析指令
│ ├── auth_analysis.md # 鉴权绕过检测指令
│ ├── hardcoded_audit.md # 硬编码过滤指令
│ └── vuln_verification.md # 漏洞验证 + PoC 生成指令
│
├── joern_scripts/ # Joern Scala 查询脚本
│ ├── find_routes.sc # 路由发现 (Spring/Servlet/JAX-RS/Struts2)
│ ├── find_sinks.sc # Sink 发现 (SQL/CMD/File/HTTP/XML/...)
│ ├── dataflow_analysis.sc # Source→Sink 污点追踪
│ ├── hardcoded_secrets.sc # 硬编码密钥检测
│ └── find_auth.sc # 鉴权代码定位
│
└── report/ # 报告生成层
    ├── generator.py # Markdown 报告生成器
    └── templates/
        └── report_template.md # 报告模板

执行流水线

复制代码

┌─────────────────────────────────────────────────────────────────┐
│ Stage 0: 预处理与索引 │
│ ┌──────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Java 源码压缩 │ │ Joern 原始输出压缩 │ │
│ │ 565 files → 426 docs │ │ 3.2MB → 18KB (99.4%) │ │
│ │ aggressive compress  │ │ 去噪 + 去重 + 关键步骤提取 │ │
│ └──────────┬───────────┘ └──────────────┬──────────────────┘ │
│ └──────────┬──────────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ ChromaDB 向量库 │ │
│ │ source_code 集合 │ │
│ │ joern 集合 │ │
│ └──────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Stage 1: 信息收集 (并行) │
│ ┌──────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Route Analysis Agent │ │ Joern Full Scan                 │ │
│ │ AI 提取路由+参数 │ │ CPG 构建 → 路由/Sink/数据流 │ │
│ │ 压缩源码 60K chars   │ │ 或加载已有 joern-raw-output     │ │
│ └──────────┬───────────┘ └──────────────┬──────────────────┘ │
│ └──────────┬──────────────────┘ │
│ ▼ 写入 routes / joern 集合 │
├─────────────────────────────────────────────────────────────────┤
│ Stage 2: 深度分析 (并行) │
│ ┌──────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Route Param Agent    │ │ Auth Analysis Agent             │ │
│ │ 从 store 按批加载路由 │ │ 从 store 检索 Filter/Config 代码 │ │
│ │ 检索相关 ServiceImpl │ │ 语义匹配鉴权相关 Joern 发现 │ │
│ │ 追踪 Source→Sink     │ │ URI 绕过 + CVE + 架构分析 │ │
│ └──────────┬───────────┘ └──────────────┬──────────────────┘ │
│ └──────────┬──────────────────┘ │
│ ▼ 写入 dataflows / auth 集合 │
├─────────────────────────────────────────────────────────────────┤
│ Stage 3: 验证与报告 (并行) │
│ ┌──────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Hardcoded Audit Agent│ │ Vuln Verification Agent         │ │
│ │ Joern 结果语义过滤 │ │ 从 store 只加载 HIGH+ 发现 │ │
│ │ 排除误报 │ │ 检索可控 dataflow               │ │
│ └──────────┬───────────┘ │ 代码证据验证 │ │
│ │ │ CVSS 三维评分 │ │
│ │ │ 生成 PoC + 伪代码 │ │
│ │ └──────────────┬──────────────────┘ │
│ └──────────┬──────────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Report Generator  │ │
│ │ Markdown 审计报告 │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

核心模块详解

CLI 入口 ( `main.py` , 144 行)

命令:
scan <project_path> 完整审计
--config, -c 配置文件路径 (默认 config.yaml)
--output, -o 输出目录
--joern-output, -j 已有 Joern 输出文件 (跳过 Joern 扫描)

check 检查配置和依赖
parse-joern 独立压缩 Joern 输出
-output, -o 输出文件

伪代码:

复制代码

def scan(project_path, config, output, joern_output):
    cfg = load_config(config) # 加载 YAML
    orchestrator = AuditOrchestrator(cfg) # 初始化编排器
    report = asyncio.run( # 异步执行
        orchestrator.run(project_path, output, joern_output)
    )
    print(f"Report: {report}")

配置系统 ( `core/config.py` , 52 行)

config.yaml

model:
base_url: "https://your-api.com" # OpenAI 兼容 API 地址
api_key: "sk-xxx" # API 密钥 (或 ANTHROPIC_API_KEY 环境变量)
model_id: "claude-opus-4-6" # 模型 ID
max_tokens: 16000 # 最大输出 token

joern:
home: "/path/to/joern-cli" # Joern 安装目录

output:
dir: "./audit_output" # 输出目录 (仅 --output 指定时使用，默认输出到项目 report/)

Pydantic 校验:

api_key: 为空时自动读取ANTHROPIC_API_KEY环境变量
joern.home: 校验目录存在

数据模型 ( `core/models.py` , 141 行)

核心数据结构

RouteInfo # 路由: path, method, handler_class, handler_method, params, burp_template
RouteParam # 参数: name, java_type, http_location, required
CallChainNode # 调用链节点: level, class_name, method_name, code_snippet, param_mapping
DataFlowChain # 数据流: source_param → sink_type + chain[] + controllable + pseudocode
Vulnerability # 漏洞: vuln_id, severity_score, poc_request, dataflow_chain, remediation
AuthInfo # 鉴权: framework, version, route_auth_mapping, bypass_findings
HardcodedSecret # 硬编码: secret_type, value_preview(脱敏), file_path
JoernResult # Joern 结果: routes, sinks, dataflows, hardcoded_secrets
AuditReport # 最终报告: 所有结果汇总

CVSS 三维评分

SeverityScore:
score = R × 0.40 + I × 0.35 + C × 0.25 # R=可达性, I=影响, C=复杂度 (0-3)
cvss = score / 3.0 × 10.0
severity:
C (Critical): score >= 2.70, CVSS 9.0-10.0
H (High): score >= 2.10, CVSS 7.0-8.9
M (Medium): score >= 1.20, CVSS 4.0-6.9
L (Low): score >= 0.10, CVSS 0.1-3.9

Sink 类型枚举

SinkType: SQL | COMMAND | HTTP | FILE | XML | LDAP | EXPRESSION | DESERIALIZE | RESPONSE | PATH

编排器 ( `core/orchestrator.py` , 220 行)

async def run(project_path, output_dir, joern_output_file):
# 默认输出到 <project_path>/report/，每次清空重建
output = output_dir or Path(project_path) / "report"
if output.exists():
shutil.rmtree(output)
output.mkdir(parents=True)

复制代码

 store = AuditStore(output / ".audit_store")
 store.reset()

 # Stage 0: 索引源码到向量库
 index_source_code(project_path)
 # → 读取所有 .java → aggressive 压缩 → 存入 source_code 集合

 # Stage 1: 并行
 if joern_output_file:
     # 加载已有 Joern 输出 → 压缩 → 存入 joern 集合
     parsed = parse_joern_output(raw_text)
     compact = to_compact_text(parsed)
     store.store_joern_findings(compact, parsed.findings)
     route_result = await route_agent.analyze(project_path)
 else:
     route_result, joern_result = await gather(
         route_agent.analyze(project_path),
         joern.full_scan(project_path), # CPG 构建 + 5 个查询脚本
     )
 store.store_routes(route_result["routes"])

 # Stage 2: 并行, Agent 从 store 按需检索
 param_result, auth_result = await gather(
     param_agent.analyze_with_store(store, joern_result),
     auth_agent.analyze_with_store(store, joern_result),
 )
 store.store_dataflows(param_result)
 store.store_auth(auth_result)

 # Stage 3: 并行, VulnVerify 只加载 HIGH+ 发现
 hardcoded_result, vuln_result = await gather(
     hardcoded_agent.audit(joern_result),
     vuln_agent.verify_with_store(store, joern_result),
 )

 # 生成报告
 report_gen.generate(all_results, output_path)

Joern 集成

5.1 Joern Runner ( `core/joern_runner.py` , 121 行)

复制代码

class JoernRunner:
    async def build_cpg(project_path):
        # javasrc2cpg <project> -o workspace/cpg.bin
        # 超时 900 秒

    async def full_scan(project_path):
        cpg = await build_cpg(project_path)
        # 并行执行 5 个查询脚本:
        routes, sinks, dataflows, hardcoded, auth = await gather(
            run_script("find_routes.sc", cpg),
            run_script("find_sinks.sc", cpg),
            run_script("dataflow_analysis.sc", cpg),
            run_script("hardcoded_secrets.sc", cpg),
            run_script("find_auth.sc", cpg),
        )

5.2 Joern 查询脚本 ( joern_scripts/ , 550 行 Scala)

复制代码

**find_routes.sc** (174 行) --- 路由发现:
```scala
// Spring MVC: @RequestMapping/@GetMapping/@PostMapping/...
cpg.annotation.name("RequestMapping|GetMapping|PostMapping|...").method → 提取 path, httpMethod, params

// Servlet: HttpServlet.doGet/doPost
cpg.typeDecl.filter(_.inheritsFromTypeFullName(".*HttpServlet.*")).method.name("doGet|doPost")

// JAX-RS: @Path + @GET/@POST
cpg.annotation.name("GET|POST|PUT|DELETE").method → 合并 class @Path + method @Path

// Struts2: ActionSupport.execute()
cpg.typeDecl.filter(_.inheritsFromTypeFullName(".*ActionSupport.*")).method.name("execute")

// 输出 JSON: RouteEntry(path, httpMethod, handlerClass, handlerMethod, filePath, lineNumber, framework, params[])
```

**find_sinks.sc** (80 行) --- 危险 Sink 发现:
```scala
// SQL: executeQuery, executeUpdate, prepareStatement, createQuery
// Command: Runtime.exec(), ProcessBuilder.start()
// File: FileInputStream, FileOutputStream, Files.read/write/copy
// HTTP: HttpClient.execute(), URL.openConnection()
// XML: DocumentBuilder.parse(), SAXParser
// Deserialize: ObjectInputStream.readObject(), JSON.parseObject()
// LDAP: DirContext.search()
// Expression: SpelExpression.getValue(), Ognl
// XSS: PrintWriter.write(), response.getWriter()
// 输出 JSON: SinkEntry(sinkType, callName, className, filePath, lineNumber, code)
```

**dataflow_analysis.sc** (87 行) --- 污点追踪:
```scala
// Sources: getParameter(), @RequestParam, @PathVariable, @RequestBody 等
// Sinks: 按类别定义 (SQL/COMMAND/FILE/HTTP/XML/DESERIALIZE/EXPRESSION)
// 使用 Joern reachableByFlows 进行 Source→Sink 可达性分析
sinks.reachableByFlows(sources) → 输出每条流的完整步骤
// 输出 JSON: DataFlowEntry(sourceParam, sinkType, steps[FlowStep(class, method, code, file, line)])
```

**hardcoded_secrets.sc** (106 行) --- 硬编码检测:
```scala
// 模式 1: 敏感变量名 + 字符串字面量赋值
// password, secret, apiKey, token, privateKey, encryptKey, jdbc 等
cpg.assignment.where(_.target.name("(?i).*password.*")).where(_.source.isLiteral)

// 模式 2: 敏感方法调用 + 字符串参数
// setPassword(), setSecret(), setApiKey() 等
cpg.call.name("setPassword|setSecret|...").where(_.argument(1).isLiteral)

// 模式 3: JDBC URL 含内嵌凭据
cpg.literal.code("\"jdbc:.*\"").filter(_.contains("password="))

// 值预览: 前4后2位 + ****
// 输出 JSON: HardcodedEntry(secretType, variableName, valuePreview, className, filePath, lineNumber)
```

**find_auth.sc** (103 行) --- 鉴权代码定位:
```scala
// Shiro: 继承 org.apache.shiro.*, @RequiresAuthentication/Permissions/Roles
// Spring Security: @EnableWebSecurity, @PreAuthorize, SecurityFilterChain
// JWT: jwt 相关的 parse/verify/decode/sign 调用
// Filter: implements javax.servlet.Filter → doFilter()
// Interceptor: implements HandlerInterceptor → preHandle()
// URI 解析: getRequestURI/getServletPath/getPathInfo 调用位置 (潜在绕过)
// 输出 JSON: AuthEntry(authType, className, methodName, filePath, code, details)
```

5.3 Joern 输出压缩器 ( `core/joern_parser.py` , 415 行)

问题 : Joern 原始输出 3.2MB (20772 行), 其中 92% 是构建日志噪音。

三层压缩 :

复制代码

Layer 1: 噪音过滤 (去除 92% 构建日志)
  - [INFO] Pass io.joern.x2cpg... completed (25 行 pass 日志)
  - [INFO] Calculating reaching definitions (8852 行)
  - [INFO] Number of definitions (8852 行)
  - [INFO] Could not create edge (1405 行)

Layer 2: Taint Flow 关键步骤提取 (每条 flow 30→5 步)
  - 去除: $obj3, <empty>, RET, StringUtils.xxx(), StringBuilder
  - 保留: Source 入口, 文件边界跳转, 危险操作 (executeQuery/readObject/transferTo), Sink 终点
  
  示例:
    原始 (30 步): FileController:82 → FileServiceImpl:43 → L47 → L90 → L99 →
      getOriginalFilename() → StringUtils.isBlank → concatPath → FileUtils:57 →
      L59 → L62 → <operator>.arrayInitializer → L31 → L34 → L35 → L38 →
      StringUtils.appendIfMissing → L40 → StringUtils.removeStart → ... → transferTo
    压缩 (5 步): FileController:82:@RequestParam MultipartFile →
      FileServiceImpl:43:MultipartFile → FileUtils:57:String path →
      FileServiceImpl:100:withBasePath → L102:transferTo(new File(fullPath))

Layer 3: 相同 Sink 去重合并
  - FILE_UPLOAD: 31 findings → 19 unique + 12 duplicates merged
  - 合并后只保留一条代表性 flow, 列出所有入口点:
    entries(7): FileController.java:47, :55, :69, :82, FileServiceImpl:43, :90, :99

压缩效果: 3,212,959 bytes → 18,308 chars (99.4% reduction)

独立使用:

复制代码

python3 main.py parse-joern joern-raw-output.txt -o compressed.txt

Java 源码压缩器 ( `core/java_compressor.py` , 342 行)

两种模式 :

Normal 模式 (~25% 压缩):

复制代码

去除: import, 单行注释, javadoc, 空行, 日志语句, Lombok 注解
保留: 大部分代码逻辑

Aggressive 模式 (~60-85% 压缩):

只保留:

✅ 路由注解 (@RequestMapping, @GetMapping, @PostMapping, @Path, @WebServlet...)

✅ 鉴权注解 (@PreAuthorize, @RequiresAuthentication, @Secured...)

✅ 方法签名 (public/protected/private 方法声明)

✅ 危险 Sink 调用 (executeQuery, readObject, getConnection, transferTo, Runtime.exec...)

✅ 输入获取 (getParameter, @RequestParam, @RequestBody, getInputStream...)

✅ SQL 字符串拼接 ("SELECT" + variable)

✅ 安全相关字段 (password, secret, token, filter, datasource...)

✅ 类/接口声明

去除:

❌ import 语句

❌ 注释 (单行/多行/javadoc)

❌ 空行

❌ 日志 (logger.debug/info/trace/warn)

❌ getter/setter 方法体

❌ Lombok 注解 (@Data, @Getter, @Setter...)

❌ 纯业务逻辑 (与安全无关的计算/转换)

实测压缩率 :

复制代码

VizServiceImpl.java:           24,944 → 3,668 chars (85%)
DataProviderServiceImpl.java:  19,830 → 2,588 chars (87%)
ShiroSecurityManager.java:     13,238 → 2,026 chars (85%)
DataProviderController.java:    4,981 → 2,636 chars (47%) # Controller 本身就紧凑

向量知识库 ( `core/audit_store.py` , 399 行)

ChromaDB 持久化存储 , 6 个集合:

复制代码

class AuditStore:
    # 6 个集合
    collections = {
        "source_code": # 每个 Java 文件一条文档 (aggressive 压缩后)
            metadata: {path, filename, file_type, size, full_content}
            # file_type: controller/service/filter/interceptor/config/dao/entity/security

        "routes": # 每个 HTTP 路由一条文档
            metadata: {path, method, handler_class, handler_method, framework, param_count, param_names, has_body, full_json}

        "joern": # 每个 Joern finding 一条文档
            metadata: {severity, category, location, full_json}

        "dataflows": # 每条 Source→Sink 链一条文档
            metadata: {route, source_param, sink_type, controllable, severity, full_json}

        "auth": # 鉴权分析结果
            metadata: {type: "auth_analysis"}

        "vulns": # 验证后漏洞
    }

    # Agent 专用查询方法
    get_routes_for_param_analysis(batch_idx, batch_size) # 按批加载路由
    get_auth_related_code() # 语义检索鉴权代码
        → query("source_code", "authentication authorization filter...",
                where={"file_type": {"$in": ["filter","interceptor","config","security"]}})
    get_findings_by_severity("HIGH") # 只加载 HIGH+
        → query("joern", "vulnerability injection...",
                where={"severity": {"$in": ["CRITICAL","HIGH"]}})
    get_controllable_dataflows() # 只加载可控数据流
        → query("dataflows", "controllable SQL...",
                where={"controllable": True})
    get_dangerous_routes(n=20) # 高危路由
    get_controller_code() # Controller 源码

渐进式加载示意 :

复制代码

Route Param Agent batch 3:
  1. store.get_routes_for_param_analysis(2, 10) → 10 条路由 (~3K chars)
  2. 提取 handler_class 名 → "UserController"
  3. store.query("source_code", "UserController") → 相关 ServiceImpl (~5K chars)
  4. store.get_findings_by_severity("HIGH") → Joern 高危发现 (~10K chars)
  5. 组装 user_message: ~18K chars (vs 之前每 batch 80K)

AI Agent 层

8.1 BaseAgent ( agents/base.py , 102 行)

复制代码

class BaseAgent:
    def __init__(config, agent_name):
        self.client = OpenAI(api_key, base_url + "/v1", timeout=300s)
        self._system_prompt = load_prompt(f"prompts/{agent_name}.md")

    def _call_model(user_message, system_prompt=None):
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
        ]
        response = client.chat.completions.create(model, max_tokens, messages)
        return response.choices[0].message.content

    async def _call_model_async(user_message):
        return await asyncio.to_thread(_call_model, user_message)

    def _read_source_files(project_path, extensions=[".java",".xml",".properties",...])
    def _read_specific_files(project_path, patterns=["*Controller.java",...])

8.2 Route Analysis Agent ( agents/route_analysis.py, 95 行)

复制代码

输入: 项目源码路径
处理:
  1. compress_java_project(project_path, max_total_chars=60000) # 压缩源码
  2. 调用 AI 分析所有路由
输出: {routes: [{path, method, handler_class, handler_method, params[], burp_template}], framework_info}

支持框架: Spring MVC, Servlet, JAX-RS, Struts 2, CXF Web Services
参数解析: @RequestParam(Query), @PathVariable(Path), @RequestBody(Body), @RequestHeader(Header), @CookieValue(Cookie)

8.3 Route Param Agent (agents/route_param.py, 209 行)

复制代码

两种模式:
  analyze_with_store(store, joern_result) # 向量库模式 (推荐)
  analyze(route_result, joern_result) # 直接传参模式 (fallback)

向量库模式:
  1. joern_context = store.get_findings_by_severity("HIGH") + joern dataflows # 一次性加载
  2. for batch in batches(10):
       routes = store.get_routes_for_param_analysis(batch_idx)
       code = store.query("source_code", handler_classes)
       result = AI(routes + code + joern_context) # 每 batch 只加载相关代码
  
输出 (每个有危险 Sink 的路由):
  {route_path, dataflow_chains: [{
    source_param, sink_type, sink_location,
    chain: [{level, class_name, method_name, code_snippet, param_mapping}],
    controllable, controllability_note, pseudocode, poc_request
  }]}

可控性判定:
  参数传到 Sink? → 硬编码覆盖? → 安全检查? → 可绕过?
  结论: ✅ 完全可控 / ⚠️ 条件可控 / ❌ 不可控

8.4 Auth Analysis Agent ( `agents/auth_analysis.py` , 167 行)

复制代码

两种模式:
  analyze_with_store(store, joern_result) # 向量库模式
  analyze(route_result, joern_result) # 直接传参模式

向量库模式:
  1. auth_code = store.get_auth_related_code() # 语义检索 Filter/Interceptor/Config
  2. routes = store.query("routes", "all", n=100)
  3. joern_auth = store.query("joern", "authentication filter interceptor...")
  4. AI(auth_code + routes + joern_auth)

检测范围:
  - 鉴权框架识别: Shiro, Spring Security, JWT, Filter, Interceptor
  - 组件版本 CVE: Shiro < 1.11.0, Spring Security < 5.7.5 等
  - URI 解析绕过: getRequestURI() vs getServletPath(), 分号注入, 双斜杠, URL编码
  - 鉴权架构分析: Filter → Interceptor → Action 各层
  - 路由鉴权映射: 每个路由的鉴权状态

输出:
  {auth_framework: {name, version, config_files, known_cves},
   auth_architecture: {layers[]},
   route_auth_mapping: {"/path": "状态"},
   bypass_findings: [{vuln_id, title, severity, poc_request, remediation}]}

8.5 Hardcoded Audit Agent ( `agents/hardcoded_audit.py` , 69 行)

复制代码

输入: Joern 硬编码扫描结果
处理: AI 语义过滤, 区分真实密钥 vs 误报
误报排除:
  - 占位符: ${password}, %s, {0}
  - 空值/null
  - 测试代码 (test/ 目录)
  - 示例值 ("your-api-key-here")
  - 已哈希密码 ($2a$10$...)
  - 配置引用

输出:
  {hardcoded_secrets: [{secret_type, risk_level, variable_name, value_preview(脱敏), file_path, line_number, remediation}],
   filtered_count: 12, filter_reasons: {placeholder: 5, test_code: 3, ...}}

8.6 Vuln Verification Agent ( `agents/vuln_verification.py` , 174 行)

复制代码

两种模式:
  verify_with_store(store, joern_result) # 向量库模式
  verify(param_result, auth_result, ...) # 直接传参模式

向量库模式 --- 只加载真正需要验证的内容:
  1. controllable_flows = store.get_controllable_dataflows() # 可控数据流
  2. high_findings = store.get_findings_by_severity("HIGH") # 高危 Joern 发现
  3. auth = store.query("auth", "bypass vulnerability") # 鉴权绕过
  4. code = store.query("source_code", "危险 Sink 关键词") # 相关源码
  5. AI(controllable_flows + high_findings + auth + code)

核心原则: 只报告有代码证据的真实漏洞
  ✅ 代码中确实存在 SQL 拼接 / readObject() / 未过滤输入到 Sink
  ❌ 推测性 ("如果服务端没有校验...")
  ❌ 框架已自动防护 (MyBatis #{}, PreparedStatement)
  ❌ 证据不足的 IDOR

验证流程:
  1. 可达性验证 (外部可访问? 需要什么权限?)
  2. 数据流验证 (Source→Sink 真实存在? 中间有过滤?)
  3. 可控性验证 (用户输入能控制到 Sink? 硬编码覆盖?)
  4. 利用可行性 (需要特殊条件? WAF 拦截?)
  5. 影响评估 (实际影响?)
  6. 代码证据 (具体文件名、行号、代码片段)

输出:
  {verified_vulnerabilities: [{
    vuln_id, title, severity_score: {R, I, C}, cvss, severity,
    exploitability: confirmed/pending/not_exploitable,
    location, description,
    dataflow_chain: {source, sink, chain_summary, pseudocode},
    poc_request, remediation
  }],
   dismissed_findings: [{original_finding, dismiss_reason, location}]}

Prompt 设计 ( prompts/ , 470 行)

每个 Agent 的 System Prompt 包含:

角色定义 (专业 Java Web 安全审计专家)

检测范围和边界

输出格式 (严格 JSON schema)

关键规则 (零遗漏 / 只报真实漏洞 / 代码证据)

特色设计:

`vuln_verification.md` : 6 条"不报告"规则, 避免推测性漏洞

`auth_analysis.md` : 完整的 URI 绕过模式表 + CVE 版本对照表

`route_param.md` : 可控性判定决策树 + MyBatis 安全/危险模式区分

报告生成器 ( `report/generator.py` , 286 行)

Markdown 报告结构 :

复制代码

# {project} - Java 代码审计报告
## 1. 审计概述 (框架识别、路由总数、分析方法)
## 2. 风险统计 (C/H/M/L 四级表)
## 3. 漏洞详情
   ### [{vuln_id}] {title}
   | 严重等级 | CVSS | 可达性 | 影响 | 复杂度 | 可利用性 | 位置 |
   **描述**
   **执行链 (Source → Sink)** --- 伪代码
   **利用数据包 (PoC)** --- Burp Suite HTTP 请求
   **修复建议**
## 4. 数据流追踪 (每个参数的 Source→Sink 链)
## 5. 鉴权分析 (框架 + CVE + 路由鉴权映射 + 绕过)
## 6. 硬编码审计 (类型/变量/值预览/文件/风险)
## 7. 审计结论 (统计汇总)

Token 优化效果总结

| 优化层 | 技术 | 压缩率 |
| Joern 输出 | 噪音过滤 + Taint Flow 关键步骤 + Sink 去重 | 99.4% (3.2MB → 18KB) |
| Java 源码 | Aggressive 模式: 只保留安全骨架 | 60-85% |
| 上下文传递 | ChromaDB 向量库按需检索替代全量传输 | ~70% |

Batch 优化	Joern context 一次加载 + batch_size 5→10	~50% API 调用

总计 : 单次审计 token 消耗从 ~340K 降至 ~95K ( ~72% reduction )

支持的漏洞类型

| 类型 | Joern 检测 | AI 检测 | 示例 |
| SQL 注入 | ✅ Source→Sink 数据流 | ✅ 可控性/绕过分析 | String 拼接 SQL, MyBatis ${} |
| 命令注入 | ✅ Runtime.exec() 追踪 | ✅ 参数过滤分析 | ProcessBuilder, Runtime.exec |
| 反序列化 | ✅ readObject() 定位 | ✅ Gadget Chain 可用性 | ObjectInputStream, JSON.parseObject |
| 文件上传 | ✅ transferTo() 追踪 | ✅ 路径穿越/类型校验 | MultipartFile, 无路径保护 |
| 文件读取 | ✅ FileInputStream 追踪 | ✅ 路径遍历分析 | path traversal (../) |
| SSRF | ✅ HttpClient/URL 追踪 | ✅ 内网探测分析 | HttpClient.execute, URL.openConnection |
| XXE | ✅ XML 解析器定位 | ✅ 外部实体配置 | DocumentBuilder, SAXParser |
| JDBC 注入 | ✅ getConnection() 追踪 | ✅ H2/MySQL 反序列化 | JDBC URL INIT=RUNSCRIPT |
| 鉴权绕过 | ✅ URI 方法定位 | ✅ 路径匹配差异分析 | getRequestURI vs getServletPath |
| 越权访问 | ❌ | ✅ 权限校验缺失 | IDOR, 参数可控资源 ID |
| 硬编码密钥 | ✅ 模式匹配 | ✅ 语义过滤误报 | 数据库密码, API Key, JWT Secret |
| 表达式注入 | ✅ SpEL/OGNL 追踪 | ✅ 输入可控性 | SpelExpression.getValue() |
| LDAP 注入 | ✅ DirContext.search() 追踪 | ✅ 过滤器拼接分析 | LDAP filter injection |

XSS	✅ response.write() 追踪	✅ 输出编码分析	反射型/存储型

支持的 Java 框架

MyBatis	-	#{} (安全) / ${} (危险)	-

最后想说的

这一段话改了好多好多次，改了又删，删了又改。

简单点说就是，过我手的东西，我会尽责任去做好去完善。但是每当我对公司产品提出问题的时候，现在身边的同事都在跟我说，客户不关心这个，只关心结果。就算以结果导向，也没看见很好的结果。只有光鲜亮丽的前端，格式规整封面华丽的报告。说实话不知道我对技术、产品的精益求精理念是对还是错。

就这样吧，祝师傅们清明安康。

结语：

一寸灰兄弟的钻研与创新精神值得我们持续学习，这个世界有它的问题，我们一定不要怀疑自己的赤子之心。

如果十年后我们还能保持今日的热血，我想那是最值得感恩的事情。

------千里

基于SAST+AI代码审计 架构与功能详解

config.yaml

核心数据结构

CVSS 三维评分

Sink 类型枚举

基于SAST+AI代码审计架构与功能详解