AAAI 2026 大模型安全相关论文整理
总目录 大模型安全研究论文整理 2026年版:https://blog.csdn.net/WhiffeYF/article/details/159047894
https://claude.ai/chat/916dfe36-9753-4199-baa2-44fc2f709fb6
统计:共收集 27 篇论文,来自 AAAI 2026(第40届,2026年1月,新加坡)
主要来源:AI Alignment 特别 Track (Vol 40 No.44)和主技术 Track(NLP / ML 等)
分类概览:
- 越狱攻击方法(Jailbreak Attack):10 篇
- 安全防御与对齐(Defense & Alignment):10 篇
- 安全评估与基准(Benchmark & Evaluation):4 篇
- 隐私与数据安全:2 篇
- 智能体安全(Agent Safety):1 篇
1 越狱攻击方法(Jailbreak Attack)
| id | 论文名 | Track | 链接 |
|---|---|---|---|
| 1 | MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs | AI Alignment | |
| 2 | Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment | AI Alignment | |
| 3 | HumorReject: Decoupling LLM Safety from Refusal Prefix via a Little Humor | AI Alignment | |
| 4 | StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak | AI Alignment | |
| 5 | STACK: Adversarial Attacks on LLM Safeguard Pipelines | AI Alignment | |
| 6 | Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment | AI Alignment | |
| 7 | Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models | AI Alignment | |
| 8 | Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models | Main Track (NLP) | |
| 9 | Multi-Turn Jailbreaking Large Language Models via Attention Shifting | Main Track (NLP) | |
| 10 | Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs (CognitiveAttack) | Main Track (NLP) |
2 安全防御与对齐(Defense & Alignment)
| id | 论文名 | Track | 链接 |
|---|---|---|---|
| 1 | AlignTree: Efficient Defense Against LLM Jailbreak Attacks | AI Alignment | |
| 2 | EASE: Practical and Efficient Safety Alignment for Small Language Models | AI Alignment | |
| 3 | Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training | AI Alignment | |
| 4 | STAR-1: Safer Alignment of Reasoning LLMs with 1K Data | AI Alignment | |
| 5 | CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing | AI Alignment | |
| 6 | WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety | Main Track (NLP) | |
| 7 | DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt | AI Alignment | |
| 8 | Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks | AI Alignment | |
| 9 | MirrorShield: Towards Dynamic Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting | Main Track (NLP) | dblp |
| 10 | AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs | Main Track (NLP) | dblp |
3 安全评估与基准(Benchmark & Evaluation)
| id | 论文名 | Track | 链接 |
|---|---|---|---|
| 1 | Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models | AI Alignment | |
| 2 | MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks | AI Alignment | |
| 3 | MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models | Main Track | |
| 4 | Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding | AI Alignment |
4 隐私与数据安全
| id | 论文名 | Track | 链接 |
|---|---|---|---|
| 1 | CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense | AI Alignment | |
| 2 | Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models | AI Alignment |
5 智能体安全(Agent Safety)
| id | 论文名 | Track | 链接 |
|---|---|---|---|
| 1 | Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems | AI Alignment |
6 其他相关论文(对齐理论 / 推理安全 / 幻觉检测 / 可解释性等)
以下论文虽然不直接属于"攻击/防御",但与大模型安全密切相关:
| id | 论文名 | 方向 | 链接 |
|---|---|---|---|
| 1 | DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs | 推理冗余/Overthinking | |
| 2 | Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning | 推理安全/CoT可靠性 | |
| 3 | Bolster Hallucination Detection via Prompt-Guided Data Augmentation | 幻觉检测 | |
| 4 | Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models | 幻觉/可靠性 | |
| 5 | Silenced Biases: The Dark Side LLMs Learned to Refuse | 对齐副作用/过度拒绝 | |
| 6 | Beyond I'm Sorry, I Can't: Dissecting Large-Language-Model Refusal | 拒绝机制分析 | |
| 7 | Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation | 微调导致的对齐失效 | |
| 8 | AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment | 后门攻击 | |
| 9 | Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning? | 机器遗忘 | |
| 10 | Polarity-Aware Probing for Quantifying Latent Alignment in Language Models | 对齐可解释性 | |
| 11 | FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight | 推理缺陷检测 | |
| 12 | Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning | 多模态后门攻击 | |
| 13 | Security Attacks on LLM-based Code Completion Tools | 代码工具安全 | |
| 14 | MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control | 智能体安全评估 | |
| 15 | MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies | 越狱攻击 | dblp |
| 16 | From Chaos to Cure: A Prefix Heuristics Guided Model-Agnostic Adaptive Detoxification Framework | 去毒化防御 | dblp |
| 17 | An LLM-based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains | 供应链后门 | AAAI 2026 |
备注
- 以上论文主要从 AI Alignment 特别 Track (Vol 40 No. 44)完整扫描获取,以及从 NLP / ML 主技术 Track 和 dblp 索引 中搜索安全相关关键词获取
- AAAI 2026 共收到约 29,000 篇投稿,安全相关论文散布在多个 Track 中(NLP I-VI, ML I-XI, Application Domains 等),上述列表可能未能完全覆盖所有 Track 中的安全论文
- 论文链接均指向 AAAI Press 官方论文集:https://ojs.aaai.org/index.php/AAAI/
- 完整 AI Alignment Track 目录:https://ojs.aaai.org/index.php/AAAI/issue/view/726
- AAAI 2026 全部论文集目录:https://aaai.org/proceeding/aaai-40-2026/