AAAI 2026 大模型安全相关论文整理

总目录大模型安全研究论文整理 2026年版：https://blog.csdn.net/WhiffeYF/article/details/159047894

https://claude.ai/chat/916dfe36-9753-4199-baa2-44fc2f709fb6

统计：共收集 27 篇论文，来自 AAAI 2026（第40届，2026年1月，新加坡）

主要来源：AI Alignment 特别 Track （Vol 40 No.44）和主技术 Track（NLP / ML 等）

分类概览：

越狱攻击方法（Jailbreak Attack）：10 篇

安全防御与对齐（Defense & Alignment）：10 篇

安全评估与基准（Benchmark & Evaluation）：4 篇

隐私与数据安全：2 篇

智能体安全（Agent Safety）：1 篇

1 越狱攻击方法（Jailbreak Attack）

id	论文名	Track	链接
1	MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs	AI Alignment	PDF
2	Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment	AI Alignment	PDF
3	HumorReject: Decoupling LLM Safety from Refusal Prefix via a Little Humor	AI Alignment	PDF
4	StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak	AI Alignment	PDF
5	STACK: Adversarial Attacks on LLM Safeguard Pipelines	AI Alignment	PDF
6	Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment	AI Alignment	PDF
7	Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models	AI Alignment	PDF
8	Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models	Main Track (NLP)	PDF
9	Multi-Turn Jailbreaking Large Language Models via Attention Shifting	Main Track (NLP)	PDF
10	Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs (CognitiveAttack)	Main Track (NLP)	PDF

2 安全防御与对齐（Defense & Alignment）

id	论文名	Track	链接
1	AlignTree: Efficient Defense Against LLM Jailbreak Attacks	AI Alignment	PDF
2	EASE: Practical and Efficient Safety Alignment for Small Language Models	AI Alignment	PDF
3	Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training	AI Alignment	PDF
4	STAR-1: Safer Alignment of Reasoning LLMs with 1K Data	AI Alignment	PDF
5	CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing	AI Alignment	PDF
6	WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety	Main Track (NLP)	PDF
7	DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt	AI Alignment	PDF
8	Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks	AI Alignment	PDF
9	MirrorShield: Towards Dynamic Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting	Main Track (NLP)	dblp
10	AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs	Main Track (NLP)	dblp

3 安全评估与基准（Benchmark & Evaluation）

id	论文名	Track	链接
1	Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models	AI Alignment	PDF
2	MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks	AI Alignment	PDF
3	MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models	Main Track	PDF
4	Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding	AI Alignment	PDF

4 隐私与数据安全

id	论文名	Track	链接
1	CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense	AI Alignment	PDF
2	Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models	AI Alignment	PDF

5 智能体安全（Agent Safety）

id	论文名	Track	链接
1	Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems	AI Alignment	PDF

6 其他相关论文（对齐理论 / 推理安全 / 幻觉检测 / 可解释性等）

以下论文虽然不直接属于"攻击/防御"，但与大模型安全密切相关：

id	论文名	方向	链接
1	DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs	推理冗余/Overthinking	PDF
2	Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning	推理安全/CoT可靠性	PDF
3	Bolster Hallucination Detection via Prompt-Guided Data Augmentation	幻觉检测	PDF
4	Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models	幻觉/可靠性	PDF
5	Silenced Biases: The Dark Side LLMs Learned to Refuse	对齐副作用/过度拒绝	PDF
6	Beyond I'm Sorry, I Can't: Dissecting Large-Language-Model Refusal	拒绝机制分析	PDF
7	Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation	微调导致的对齐失效	PDF
8	AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment	后门攻击	PDF
9	Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?	机器遗忘	PDF
10	Polarity-Aware Probing for Quantifying Latent Alignment in Language Models	对齐可解释性	PDF
11	FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight	推理缺陷检测	PDF
12	Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning	多模态后门攻击	PDF
13	Security Attacks on LLM-based Code Completion Tools	代码工具安全	PDF
14	MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control	智能体安全评估	PDF
15	MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies	越狱攻击	dblp
16	From Chaos to Cure: A Prefix Heuristics Guided Model-Agnostic Adaptive Detoxification Framework	去毒化防御	dblp
17	An LLM-based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains	供应链后门	AAAI 2026

备注

以上论文主要从 AI Alignment 特别 Track （Vol 40 No. 44）完整扫描获取，以及从 NLP / ML 主技术 Track 和 dblp 索引 中搜索安全相关关键词获取
AAAI 2026 共收到约 29,000 篇投稿，安全相关论文散布在多个 Track 中（NLP I-VI, ML I-XI, Application Domains 等），上述列表可能未能完全覆盖所有 Track 中的安全论文
论文链接均指向 AAAI Press 官方论文集：https://ojs.aaai.org/index.php/AAAI/
完整 AI Alignment Track 目录：https://ojs.aaai.org/index.php/AAAI/issue/view/726
AAAI 2026 全部论文集目录：https://aaai.org/proceeding/aaai-40-2026/