🚀 Day 10:直面深水区 ------ 总结系统痛点与底层架构重塑
今日目标 :在经历了前几天的狂欢,系统成功跑通了 AI 闭环后,我们必须立刻脱下"安全分析师"的外套,换上"底层架构师 "的思维。在真实的企业级环境中,海量数据与 API 限制会轻易摧毁我们在 Day 9 构建的"完美结构"。今天,我们将直面大模型开发中最致命的两大痛点:Token 额度熔断 与 JSON 超长嵌套截断。通过引入防弹兜底逻辑与"扁平化革命",我们将完成从"能跑通的玩具脚本"到"坚不可摧的企业级引擎"的蜕变!
🚨 架构师复盘:为什么 Day 9 的代码在实战中必定会崩溃?
在 Day 9,我们把"初始蓝图"、"执行证据"和"最终战报"塞进了一个巨大的 master_payload 字典中,最后一次性写入 Splunk。这种设计在软件工程中被称为"上帝对象(God Object)反模式"。
它会带来两个致命的灾难:
- 大模型 Token 额度熔断(Token Storm) : 如果 Day 8 收集到的证据极为庞大,或者 AI 在 Day 9 分析时"长篇大论",生成的 JSON 极易超过 API 的单次上下文输出上限(例如 4000 Tokens)。一旦触顶,API 会硬生生切断输出,导致返回的 JSON 缺失结尾的
}括号。Python 在执行json.loads()时会当场抛出JSONDecodeError异常,导致整个 5 分钟的调度任务直接挂掉。 - Splunk 嵌套解析崩溃(The Nested JSON Trap): 把数百行、嵌套了四五层的巨型 JSON 单行日志强行塞给 Splunk,不仅会严重拖垮底层全文检索(Text-search)的 I/O 性能,还会导致 Splunk 前端的字段自动提取器(Auto KV Extraction)罢工,让你在大屏上根本无法提取深层字段。
🛠️ 代码优化 1:防弹衣 (Bulletproof Fallback)
优化方案:
- 在未来接入真实网络请求时,必须在 API 参数中强制指定
max_tokens。 - 在代码中编写极其强壮的
try...except json.JSONDecodeError兜底逻辑。核心原则:即便 JSON 断裂,程序也不能死,必须将残缺的原始文本(Raw Text)作为"错误事件"写入日志,保留现场!
🛠️ 代码优化 2:扁平化革命 (The Flattening Revolution)
优化方案 : 引入全局唯一标识符(UUID)。我们在单次调度周期开始时生成一个 session_id,然后彻底打碎那个巨型的 master_payload。 把整个过程拆分为三个独立且扁平的事件 写入 Splunk,它们通过 session_id 像链条一样相互关联:
- 第一条日志:
event_type="PEAK_Plan"(刚拿到计划就立刻落盘,防止后续崩溃导致计划也丢失)。 - 第二条日志:
event_type="PEAK_Evidence"(收集完证据立刻落盘)。 - 第三条日志:
event_type="PEAK_Final_Report"(拿到最终打分后落盘)。
💻 终极实战:Day 10 全量架构重构代码
请打开 Add-on Builder 的 Define & Test 编辑器。清空原有代码,复制并粘贴这套经过终极架构优化的全量代码。
python
import os
import sys
import time
import datetime
import json
import uuid
import splunklib.client as client
import splunklib.results as results
def execute_ai_spl(helper, service, spl_query):
"""
Geek Helper Function: Execute SPL generated by AI and return the raw result data.
"""
spl_query = spl_query.strip()
# Force the 'search' prefix on AI-generated SPL to prevent syntax errors
if not spl_query.startswith("search") and not spl_query.startswith("|"):
spl_query = "search " + spl_query
kwargs_oneshot = {"output_mode": "json"}
helper.log_info(f"[Agentic Engine] Running SPL: {spl_query}")
try:
# Execute oneshot synchronous search using Splunk Python SDK
search_results = service.jobs.oneshot(spl_query, **kwargs_oneshot)
reader = results.JSONResultsReader(search_results)
result_data = []
for result in reader:
if isinstance(result, dict):
result_data.append(result)
helper.log_info(f"[Agentic Engine] SUCCESS: Found {len(result_data)} events.")
return result_data
except Exception as e:
helper.log_error(f"[Agentic Engine] FAILED: SPL execution error: {str(e)}")
return []
def collect_events(helper, ew):
"""
Day 10: Enterprise-Grade Flattened Architecture with Resilience.
Handles Token Storms, JSON Truncation, and Splunk God-Object Anti-patterns.
"""
helper.log_info("PEAK AI Hunter started the flattened execution cycle.")
cycle_start_time = time.time()
# Generate a unique Session ID to bind all flattened events together
hunt_session_id = str(uuid.uuid4())
helper.log_info(f"Initialized new Hunt Session ID: {hunt_session_id}")
try:
# ==========================================
# STEP 1: Secure Configuration Setup
# ==========================================
session_key = getattr(helper, 'session_key', None)
if not session_key and hasattr(helper, '_input_definition'):
session_key = getattr(helper._input_definition, 'metadata', {}).get('session_key')
if not session_key:
helper.log_error("Failed to acquire session_key. Authentication failed.")
return
service = client.Service(token=session_key)
target_index = helper.get_output_index() or "main"
timestamp_now = datetime.datetime.utcnow().isoformat()
# ==========================================
# STEP 2: The Prepare Phase Blueprint & Immediate Ingestion
# ==========================================
mock_llm_json_string = """
{
"analysis": "Discovered rare MySQL 1045 errors indicating potential brute-force activity.",
"hypotheses": [
{
"hypothesis_id": 1,
"ABLE": {
"Actor": "External attacker",
"Behavior": "Database credential brute-forcing (T1110)",
"Location": "MySQL Database Server",
"Evidence": "High volume of status=1045 logs in a short timeframe"
},
"spl_round_1_validation": "search index=main status=1045 | stats count by src_ip | sort -count",
"spl_round_2_drilldown": "search index=main src_ip=192.168.1.10 | transaction maxspan=5m"
}
]
}
"""
# Parse blueprint
ai_hunting_plan = json.loads(mock_llm_json_string)
hypotheses = ai_hunting_plan.get("hypotheses", [])
# FLATTENING REVOLUTION 1: Write the plan to Splunk IMMEDIATELY
plan_event = helper.new_event(
source=helper.get_input_type(),
index=target_index,
sourcetype="_json",
data=json.dumps({
"session_id": hunt_session_id,
"event_type": "PEAK_Plan",
"timestamp": timestamp_now,
"content": ai_hunting_plan
}, ensure_ascii=False)
)
ew.write_event(plan_event)
# ==========================================
# STEP 3: The Execute Phase (Agentic Loop)
# ==========================================
all_hunt_evidence = []
for i, hyp in enumerate(hypotheses):
hyp_start_time = time.time()
helper.log_info(f"=== Executing Hunt for Hypothesis [{i+1}] ===")
spl_r1 = hyp.get("spl_round_1_validation", "").replace("{target_index}", target_index)
spl_r2 = hyp.get("spl_round_2_drilldown", "").replace("{target_index}", target_index)
r1_results = execute_ai_spl(helper, service, spl_r1)
r2_results = execute_ai_spl(helper, service, spl_r2)
evidence_package = {
"hypothesis_id": hyp.get("hypothesis_id", i+1),
"threat_behavior": hyp['ABLE'].get('Behavior'),
"round_1_hit_count": len(r1_results),
"round_2_hit_count": len(r2_results),
"execution_duration_sec": round(time.time() - hyp_start_time, 2)
}
all_hunt_evidence.append(evidence_package)
# FLATTENING REVOLUTION 2: Write the collected evidence to Splunk
evidence_event = helper.new_event(
source=helper.get_input_type(),
index=target_index,
sourcetype="_json",
data=json.dumps({
"session_id": hunt_session_id,
"event_type": "PEAK_Evidence",
"timestamp": timestamp_now,
"content": all_hunt_evidence
}, ensure_ascii=False)
)
ew.write_event(evidence_event)
# ==========================================
# STEP 4: The Act Phase & Bulletproof Fallback
# ==========================================
helper.log_info("Triggering LLM API for Final Assessment...")
# Mocking an LLM response (Imagine this was generated via requests.post with max_tokens)
mock_act_response = """
{
"executive_summary": "Confirmed Threat: High volume of credential brute-forcing detected.",
"threat_qualification": "Confirmed Threat",
"risk_score": 92
}
"""
final_report = {}
# BULLETPROOF FALLBACK: Handle JSON Truncation gracefully
try:
final_report = json.loads(mock_act_response.strip())
except json.JSONDecodeError as e:
helper.log_error(f"Token Storm / JSON Truncation detected! Preserving raw text. Detail: {str(e)}")
# Rescue the raw, broken string instead of crashing the pipeline
final_report = {
"executive_summary": "SYSTEM ERROR: LLM Token Limit Exceeded or Output Truncated.",
"threat_qualification": "Unknown",
"risk_score": -1,
"raw_unparsed_text": mock_act_response
}
# FLATTENING REVOLUTION 3: Write the final report as a standalone event
report_event = helper.new_event(
source=helper.get_input_type(),
index=target_index,
sourcetype="_json",
data=json.dumps({
"session_id": hunt_session_id,
"event_type": "PEAK_Final_Report",
"timestamp": timestamp_now,
"content": final_report
}, ensure_ascii=False)
)
ew.write_event(report_event)
total_cycle_time = round(time.time() - cycle_start_time, 2)
helper.log_info(f"SUCCESS: Flattened architecture pipeline completed in {total_cycle_time} seconds. Session ID: {hunt_session_id}")
except Exception as e:
helper.log_error(f"Critical Failure in Enterprise Architecture Workflow: {str(e)}")
🔍 极客验证:体验扁平化数据的"拼接魔法"
点击 AOB 右上角的 Test 按钮,并且点击 Save 保存代码。
以前你是在一条巨型 JSON 里找数据,现在数据被拆成了三条轻巧的日志。未来到了 Day 16 做大屏时,你要怎么把它们拼回来呢?
请前往 Splunk 的 Search & Reporting,执行这段极具艺术感的企业级查询:
spl
index=main sourcetype="_json" event_type="PEAK_Plan" OR event_type="PEAK_Evidence" OR event_type="PEAK_Final_Report"
| stats
latest(content.risk_score) as Risk_Score,
latest(content.executive_summary) as Summary,
sum(content{}.round_1_hit_count) as Total_R1_Hits,
sum(content{}.round_2_hit_count) as Total_R2_Hits
by session_id
| sort - Risk_Score
🎉 架构重塑成功标志: 此时,你会看到 Splunk 以极其丝滑的速度瞬间返回结果!利用强大的 stats by session_id 机制,Splunk 在毫秒间就将这三个独立的日志重新拼接在了一起。
- 如果大模型崩溃了 :
Risk_Score会显示为-1,并且你能看到错误详情,但计划和证据依然安全地保存在硬盘里! - 性能飞跃:日志体积大幅减小,极大释放了 Splunk 索引器的内存压力,再也不用担心复杂 JSON 导致系统假死。
大功告成! 你的底层 Python 引擎现在不仅能跑,而且具备了"容错(Resilience)"与"解耦(Decoupling)"的企业级血统。我们随时准备迎接下一个挑战!