Day10：直面深水区——总结系统痛点与底层架构重塑

🚀 Day 10：直面深水区 ------ 总结系统痛点与底层架构重塑

今日目标 ：在经历了前几天的狂欢，系统成功跑通了 AI 闭环后，我们必须立刻脱下"安全分析师"的外套，换上"底层架构师 "的思维。在真实的企业级环境中，海量数据与 API 限制会轻易摧毁我们在 Day 9 构建的"完美结构"。今天，我们将直面大模型开发中最致命的两大痛点：Token 额度熔断 与 JSON 超长嵌套截断。通过引入防弹兜底逻辑与"扁平化革命"，我们将完成从"能跑通的玩具脚本"到"坚不可摧的企业级引擎"的蜕变！

🚨 架构师复盘：为什么 Day 9 的代码在实战中必定会崩溃？

在 Day 9，我们把"初始蓝图"、"执行证据"和"最终战报"塞进了一个巨大的 master_payload 字典中，最后一次性写入 Splunk。这种设计在软件工程中被称为"上帝对象（God Object）反模式"。

它会带来两个致命的灾难：

大模型 Token 额度熔断（Token Storm） ：如果 Day 8 收集到的证据极为庞大，或者 AI 在 Day 9 分析时"长篇大论"，生成的 JSON 极易超过 API 的单次上下文输出上限（例如 4000 Tokens）。一旦触顶，API 会硬生生切断输出，导致返回的 JSON 缺失结尾的 } 括号。Python 在执行 json.loads() 时会当场抛出 JSONDecodeError 异常，导致整个 5 分钟的调度任务直接挂掉。
Splunk 嵌套解析崩溃（The Nested JSON Trap）：把数百行、嵌套了四五层的巨型 JSON 单行日志强行塞给 Splunk，不仅会严重拖垮底层全文检索（Text-search）的 I/O 性能，还会导致 Splunk 前端的字段自动提取器（Auto KV Extraction）罢工，让你在大屏上根本无法提取深层字段。

🛠️ 代码优化 1：防弹衣 (Bulletproof Fallback)

优化方案：

在未来接入真实网络请求时，必须在 API 参数中强制指定 max_tokens。
在代码中编写极其强壮的 try...except json.JSONDecodeError 兜底逻辑。核心原则：即便 JSON 断裂，程序也不能死，必须将残缺的原始文本（Raw Text）作为"错误事件"写入日志，保留现场！

🛠️ 代码优化 2：扁平化革命 (The Flattening Revolution)

优化方案 ：引入全局唯一标识符（UUID）。我们在单次调度周期开始时生成一个 session_id，然后彻底打碎那个巨型的 master_payload。把整个过程拆分为三个独立且扁平的事件 写入 Splunk，它们通过 session_id 像链条一样相互关联：

第一条日志：event_type="PEAK_Plan"（刚拿到计划就立刻落盘，防止后续崩溃导致计划也丢失）。
第二条日志：event_type="PEAK_Evidence"（收集完证据立刻落盘）。
第三条日志：event_type="PEAK_Final_Report"（拿到最终打分后落盘）。

💻 终极实战：Day 10 全量架构重构代码

请打开 Add-on Builder 的 Define & Test 编辑器。清空原有代码，复制并粘贴这套经过终极架构优化的全量代码。

python 复制代码

import os
import sys
import time
import datetime
import json
import uuid
import splunklib.client as client
import splunklib.results as results

def execute_ai_spl(helper, service, spl_query):
    """
    Geek Helper Function: Execute SPL generated by AI and return the raw result data.
    """
    spl_query = spl_query.strip()
    
    # Force the 'search' prefix on AI-generated SPL to prevent syntax errors
    if not spl_query.startswith("search") and not spl_query.startswith("|"):
        spl_query = "search " + spl_query
        
    kwargs_oneshot = {"output_mode": "json"}
    helper.log_info(f"[Agentic Engine] Running SPL: {spl_query}")
    
    try:
        # Execute oneshot synchronous search using Splunk Python SDK
        search_results = service.jobs.oneshot(spl_query, **kwargs_oneshot)
        reader = results.JSONResultsReader(search_results)
        
        result_data = []
        for result in reader:
            if isinstance(result, dict):
                result_data.append(result)
                
        helper.log_info(f"[Agentic Engine] SUCCESS: Found {len(result_data)} events.")
        return result_data
    except Exception as e:
        helper.log_error(f"[Agentic Engine] FAILED: SPL execution error: {str(e)}")
        return []

def collect_events(helper, ew):
    """
    Day 10: Enterprise-Grade Flattened Architecture with Resilience.
    Handles Token Storms, JSON Truncation, and Splunk God-Object Anti-patterns.
    """
    helper.log_info("PEAK AI Hunter started the flattened execution cycle.")
    cycle_start_time = time.time()

    # Generate a unique Session ID to bind all flattened events together
    hunt_session_id = str(uuid.uuid4())
    helper.log_info(f"Initialized new Hunt Session ID: {hunt_session_id}")

    try:
        # ==========================================
        # STEP 1: Secure Configuration Setup
        # ==========================================
        session_key = getattr(helper, 'session_key', None)
        if not session_key and hasattr(helper, '_input_definition'):
            session_key = getattr(helper._input_definition, 'metadata', {}).get('session_key')
            
        if not session_key:
            helper.log_error("Failed to acquire session_key. Authentication failed.")
            return

        service = client.Service(token=session_key)
        target_index = helper.get_output_index() or "main"
        timestamp_now = datetime.datetime.utcnow().isoformat()

        # ==========================================
        # STEP 2: The Prepare Phase Blueprint & Immediate Ingestion
        # ==========================================
        mock_llm_json_string = """
        {
            "analysis": "Discovered rare MySQL 1045 errors indicating potential brute-force activity.",
            "hypotheses": [
                {
                    "hypothesis_id": 1,
                    "ABLE": {
                        "Actor": "External attacker",
                        "Behavior": "Database credential brute-forcing (T1110)",
                        "Location": "MySQL Database Server",
                        "Evidence": "High volume of status=1045 logs in a short timeframe"
                    },
                    "spl_round_1_validation": "search index=main status=1045 | stats count by src_ip | sort -count",
                    "spl_round_2_drilldown": "search index=main src_ip=192.168.1.10 | transaction maxspan=5m"
                }
            ]
        }
        """

        # Parse blueprint
        ai_hunting_plan = json.loads(mock_llm_json_string)
        hypotheses = ai_hunting_plan.get("hypotheses", [])

        # FLATTENING REVOLUTION 1: Write the plan to Splunk IMMEDIATELY
        plan_event = helper.new_event(
            source=helper.get_input_type(),
            index=target_index,
            sourcetype="_json",
            data=json.dumps({
                "session_id": hunt_session_id,
                "event_type": "PEAK_Plan",
                "timestamp": timestamp_now,
                "content": ai_hunting_plan
            }, ensure_ascii=False)
        )
        ew.write_event(plan_event)
        
        # ==========================================
        # STEP 3: The Execute Phase (Agentic Loop)
        # ==========================================
        all_hunt_evidence = [] 
        
        for i, hyp in enumerate(hypotheses):
            hyp_start_time = time.time() 
            helper.log_info(f"=== Executing Hunt for Hypothesis [{i+1}] ===")
            
            spl_r1 = hyp.get("spl_round_1_validation", "").replace("{target_index}", target_index)
            spl_r2 = hyp.get("spl_round_2_drilldown", "").replace("{target_index}", target_index)
            
            r1_results = execute_ai_spl(helper, service, spl_r1)
            r2_results = execute_ai_spl(helper, service, spl_r2)
            
            evidence_package = {
                "hypothesis_id": hyp.get("hypothesis_id", i+1),
                "threat_behavior": hyp['ABLE'].get('Behavior'),
                "round_1_hit_count": len(r1_results),
                "round_2_hit_count": len(r2_results),
                "execution_duration_sec": round(time.time() - hyp_start_time, 2)
            }
            all_hunt_evidence.append(evidence_package)

        # FLATTENING REVOLUTION 2: Write the collected evidence to Splunk
        evidence_event = helper.new_event(
            source=helper.get_input_type(),
            index=target_index,
            sourcetype="_json",
            data=json.dumps({
                "session_id": hunt_session_id,
                "event_type": "PEAK_Evidence",
                "timestamp": timestamp_now,
                "content": all_hunt_evidence
            }, ensure_ascii=False)
        )
        ew.write_event(evidence_event)

        # ==========================================
        # STEP 4: The Act Phase & Bulletproof Fallback
        # ==========================================
        helper.log_info("Triggering LLM API for Final Assessment...")
        
        # Mocking an LLM response (Imagine this was generated via requests.post with max_tokens)
        mock_act_response = """
        {
            "executive_summary": "Confirmed Threat: High volume of credential brute-forcing detected.",
            "threat_qualification": "Confirmed Threat",
            "risk_score": 92
        }
        """
        
        final_report = {}
        # BULLETPROOF FALLBACK: Handle JSON Truncation gracefully
        try:
            final_report = json.loads(mock_act_response.strip())
        except json.JSONDecodeError as e:
            helper.log_error(f"Token Storm / JSON Truncation detected! Preserving raw text. Detail: {str(e)}")
            # Rescue the raw, broken string instead of crashing the pipeline
            final_report = {
                "executive_summary": "SYSTEM ERROR: LLM Token Limit Exceeded or Output Truncated.",
                "threat_qualification": "Unknown",
                "risk_score": -1,
                "raw_unparsed_text": mock_act_response
            }

        # FLATTENING REVOLUTION 3: Write the final report as a standalone event
        report_event = helper.new_event(
            source=helper.get_input_type(),
            index=target_index,
            sourcetype="_json",
            data=json.dumps({
                "session_id": hunt_session_id,
                "event_type": "PEAK_Final_Report",
                "timestamp": timestamp_now,
                "content": final_report
            }, ensure_ascii=False)
        )
        ew.write_event(report_event)
        
        total_cycle_time = round(time.time() - cycle_start_time, 2)
        helper.log_info(f"SUCCESS: Flattened architecture pipeline completed in {total_cycle_time} seconds. Session ID: {hunt_session_id}")

    except Exception as e:
        helper.log_error(f"Critical Failure in Enterprise Architecture Workflow: {str(e)}")

🔍 极客验证：体验扁平化数据的"拼接魔法"

点击 AOB 右上角的 Test 按钮，并且点击 Save 保存代码。

以前你是在一条巨型 JSON 里找数据，现在数据被拆成了三条轻巧的日志。未来到了 Day 16 做大屏时，你要怎么把它们拼回来呢？

请前往 Splunk 的 Search & Reporting，执行这段极具艺术感的企业级查询：

spl 复制代码

index=main sourcetype="_json" event_type="PEAK_Plan" OR event_type="PEAK_Evidence" OR event_type="PEAK_Final_Report"
| stats 
    latest(content.risk_score) as Risk_Score,
    latest(content.executive_summary) as Summary,
    sum(content{}.round_1_hit_count) as Total_R1_Hits,
    sum(content{}.round_2_hit_count) as Total_R2_Hits
  by session_id
| sort - Risk_Score

🎉 架构重塑成功标志： 此时，你会看到 Splunk 以极其丝滑的速度瞬间返回结果！利用强大的 stats by session_id 机制，Splunk 在毫秒间就将这三个独立的日志重新拼接在了一起。

如果大模型崩溃了 ：Risk_Score 会显示为 -1，并且你能看到错误详情，但计划和证据依然安全地保存在硬盘里！
性能飞跃：日志体积大幅减小，极大释放了 Splunk 索引器的内存压力，再也不用担心复杂 JSON 导致系统假死。

大功告成！ 你的底层 Python 引擎现在不仅能跑，而且具备了"容错（Resilience）"与"解耦（Decoupling）"的企业级血统。我们随时准备迎接下一个挑战！