ICLR 2026 LLM安全相关论文整理

ICLR 2026 大模型安全相关论文整理

整理日期：2026-04-11

总览

ICLR 2026 共录用 5300+ 篇论文（其中 Oral 223 篇），本文从中筛选出 约 50 篇与大模型安全直接相关的论文，涵盖 Oral 和 Poster 两个类别。

按主题分为 9 大类：

类别	篇数	说明
1. 越狱攻击	5	提示词改写、梯度优化、多臂老虎机、古典中文等越狱方法
2. 推理模型安全	5	思维链劫持、推理过程对齐、CoT 干预鲁棒性
3. 安全对齐与防御	11	RL 安全对齐、推理式防御、多语言一致性、解码时探测
4. 微调攻击 / 后门攻击	5	LoRA 后门、隐写术恶意微调、有害梯度衰减防御
5. 智能体安全	6	分解攻击监控、控制流劫持、Agent-to-Agent 安全基准
6. 多模态安全	6	VLM 越狱迁移、音频模型越狱、视觉后门攻击
7. 安全评估与基准	2	多轮越狱基准、音频可信度基准
8. 代码 / 生成安全	5	安全代码生成、水印、深度伪造检测
9. 其它相关	5	激活引导、诚实对齐、偏见放大、概念擦除
附录：Oral 安全论文	11	Constitutional Classifiers++、ASIDE、UltraBreak 等

整体来看，ICLR 2026 安全方向的论文呈现几个趋势：① 推理模型安全 成为新热点，多篇论文关注 CoT 被劫持/利用的问题；② 越狱攻防进入组合化、自动化阶段 ，字典学习、元优化等方法出现；③ 智能体安全 作为新兴方向快速增长，分解攻击、控制流劫持等受到关注；④ 安全对齐从浅层走向深层，多篇论文探索任意深度对齐和基于推理的对齐。

1 越狱攻击（Jailbreak Attacks）

#	论文名	OpenReview链接	简介
1	Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges (AMIS)	poster/10008164	提出AMIS元优化框架，联合进化越狱提示和评分模板，通过双层优化实现自动化越狱
2	Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search (CC-BOS)	openreview	利用古典中文语境的8维搜索空间，结合生物启发式优化生成越狱提示，对推理模型也达到100% ASR
3	Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks	poster/10009061	提出"对抗性既视感"假说：未来越狱是已有对抗技能原语的组合；通过字典学习增强对未见攻击的泛化
4	One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs	openreview	研究可迁移的鲁棒越狱提示生成
5	Efficient Jailbreak Attack Sequences on LLMs via Multi-Armed Bandit-Based Context Switching	openreview	基于多臂老虎机的上下文切换实现高效越狱攻击序列

2 推理模型安全 / 思维链安全（Reasoning & CoT Safety）

#	论文名	OpenReview链接	简介
1	Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention (IPO)	forum/2uTxLC4LmC	提出Intervened Preference Optimization (IPO)，通过将合规步骤替换为安全触发器实现推理过程本身的安全对齐
2	Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check	openreview	先回答后检查的推理安全对齐方法
3	Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training	openreview	揭示推理训练后模型可以通过推理自行绕过安全对齐
4	AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models	poster/10007590	对抗性思维链调优以增强推理模型的安全对齐鲁棒性
5	Are Reasoning LLMs Robust to Interventions on their Chain-of-Thought?	poster/10008704	研究推理LLM对思维链干预的鲁棒性

3 安全对齐与防御（Safety Alignment & Defense）

#	论文名	OpenReview链接	简介
1	AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning	poster/10011731	通过极简化RL（仅需二元安全标签和不到200步RL）激励模型内在安全意识，实现推理式安全对齐
2	ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning	poster/10009011	提出三步推理防御流水线：策略分析→意图提取→策略安全验证，对OOD越狱攻击ASR降至0.06
3	AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint	poster/10011789	基于零空间约束的拒绝引导学习
4	A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space	poster/10011231	安全敏感子空间与有害抵抗零空间结合的安全护栏
5	Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth	poster/10011912	解锁LLM从浅层到任意深度的内在安全对齐
6	Alignment-Weighted DPO: A Principled Reasoning Approach to Improve Safety Alignment	poster/10009740	基于原理推理的加权DPO方法改进安全对齐
7	Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment	poster/10006879	一次对齐、多语言受益的跨语言安全一致性方法
8	Aligning Deep Implicit Preferences by Learning to Reason Defensively	poster/10008837	通过学习防御性推理对齐深层隐式偏好
9	A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models	poster/10009223	扩散语言模型的任意顺序、任意步安全对齐
10	SIRL: Self-Incentivized Reinforcement Learning for Safety (基于Entropy的安全RL)	openreview	发现响应熵是安全的可靠内在信号，通过熵最小化实现无外部奖励的安全增强
11	From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment	openreview	将拒绝感知的注入攻击转化为安全对齐工具

4 微调攻击 / 后门攻击（Fine-tuning & Backdoor Attacks）

#	论文名	OpenReview链接	简介
1	JailbreakLoRA: Your Downloaded LoRA from Sharing Platforms might be Unsafe	forum/4YgvVRoSnF	揭示共享平台上的LoRA可能包含越狱后门
2	Invisible Safety Threat: Malicious Finetuning for LLM via Steganography	poster/10011363	通过隐写术实现恶意微调，模型表面安全对齐但暗中生成有害内容
3	Revisiting Backdoor Attacks on LLMs	openreview	重新审视LLM后门攻击，提出隐式投毒策略，保持安全对齐的同时注入后门
4	Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence	poster/10007199	通过衰减有害梯度影响来加强对有害微调的防御
5	Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study	EPFL团队	安全子空间并非线性可分：微调案例研究

5 智能体安全（Agent Safety）

#	论文名	OpenReview链接	简介
1	Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems	openreview	多智能体系统中控制流劫持的攻防
2	RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments	openreview	面向计算机使用智能体的真实对抗测试
3	Monitoring Decomposition Attacks	openreview	分解攻击监控：发现轻量级顺序监控器可有效防御分解攻击，4634个有害-良性任务对数据集
4	Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols	poster/10006727	对可信监控器的自适应攻击可颠覆AI控制协议
5	A2ASecBench: A Protocol-Aware Security Benchmark for Agent-to-Agent Multi-Agent Systems	poster/10010017	面向智能体间多智能体系统的协议感知安全基准
6	AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?	poster/10007726	追踪LLM智能体系统中谁在诱导失败

6 多模态安全（Multimodal Safety）

#	论文名	OpenReview链接	简介
1	VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety	openreview	映射多模态联合理解在AI安全中的极限
2	ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks	poster/10006730	自适应红队智能体，可插拔地对VLM进行全面风险评估
3	JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models	openreview	音频语言模型越狱漏洞基准，11316文本样本+245355音频样本
4	GuardAlign: Safety Alignment for Vision-Language Models via Optimal Transport	openreview	基于最优传输的视觉语言模型安全对齐
5	AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization	poster/10010620	通过偏好优化增强大型视觉语言模型的对抗鲁棒性
6	BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning	ICLR 2026 Downloads列表	VLM具身智能体的视觉后门攻击

7 安全评估与基准（Safety Evaluation & Benchmarks）

#	论文名	OpenReview链接	简介
1	MultiBreak: Scalable Multi-Turn Jailbreak Benchmark	openreview	大规模多轮越狱基准，1724个意图+多轮对抗提示，覆盖9粗粒26细粒安全类别
2	AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models	poster	音频大语言模型多维可信度基准

8 安全相关的代码/生成安全（Code & Generation Safety）

#	论文名	OpenReview链接	简介
1	SecCoderX: Secure Code Generation via Reasoning-Based Vulnerability Reward Model	openreview	基于推理的漏洞奖励模型实现安全代码生成，首次在不损害功能性的前提下提升安全率11-16%
2	Analyzing and Evaluating Unbiased Language Model Watermark	poster/10011375	分析与评估无偏语言模型水印
3	An Ensemble Framework for Unbiased Language Model Watermarking	poster/10007956	无偏语言模型水印的集成框架
4	All Patches Matter: Enhance AI-Generated Image Detection via Panoptic Patch Learning	poster/10007395	全景补丁学习增强AI生成图像检测
5	A Rich Knowledge Space for Scalable Deepfake Detection	poster/10008071	可扩展深度伪造检测的丰富知识空间

9 其它相关论文

#	论文名	OpenReview链接	简介
1	Activation Steering with a Feedback Controller	poster/10006765	基于反馈控制器的激活引导（表征工程相关）
2	Annotation-Efficient Honesty Alignment via Confidence Elicitation and Calibration	poster/10008495	通过置信度引出和校准实现标注高效的诚实对齐
3	Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models	poster/10008156	消除语言模型重复模式的框架
4	Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems	poster/10007543	多智能体系统中偏见放大的测量
5	AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models	poster/10011590	扩散模型中无保留数据的鲁棒概念擦除

10 ICLR 2026 相关Workshops

#	Workshop名	链接
1	Agents in the Wild: Safety, Security, and Beyond	workshop/10000781
2	AI for Peace	workshop/10000804
3	Algorithmic Fairness Across Alignment Procedures and Agentic Systems	workshop/10000786

附录：Oral 论文中与安全相关的论文

从ICLR 2026共223篇Oral论文中，以下为与LLM安全直接相关的Oral论文（大部分你已在之前的收集中包含）：

#	论文名	类型	说明
1	Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models (UltraBreak)	Oral	首个同时实现跨目标通用性和跨模型迁移性的VLM越狱框架 forum/T5hD0as3jb
2	Defending LLMs Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing	Oral	解码时安全感知探测防御，利用模型内部潜在安全信号进行早期检测
3	ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack	Oral	激活缩放守卫，缓解定向越狱攻击
4	ASIDE: Architectural Separation of Instructions and Data in Language Models	Oral	语言模型中指令与数据的架构级分离（防提示注入）
5	GAVEL: Towards Rule-Based Safety through Activation Monitoring	Oral	通过激活监控实现基于规则的安全
6	Constitutional Classifiers++: Production-Grade Defenses against Universal Jailbreaks	Oral	Anthropic的生产级通用越狱防御（Constitutional AI升级版）
7	Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization	Oral	通过零空间约束策略优化缓解安全对齐税
8	Time-To-Inconsistency: A Survival Analysis of LLM Robustness to Adversarial Attacks	Oral	对LLM对抗攻击鲁棒性的生存分析
9	GhostEI-Bench: Do Mobile Agent Resilience to Environmental Injection in Dynamic On-Device Environments?	Oral	移动智能体在动态设备环境中对环境注入的抗性基准
10	Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!	Oral	开源LLM微调时微调数据可能被秘密窃取
11	GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs	Oral	面向多模态LLM的幻觉诱导图像生成

说明：ICLR 2026共接收5300+篇论文（223篇Oral），本列表从中筛选出与LLM安全、越狱攻击、推理安全、对齐、智能体安全、多模态安全等直接相关的论文。部分OpenReview链接为近似链接（基于论文标题搜索），请以ICLR官方虚拟网站为准。

GitHub上有完整的223篇Oral论文列表（含中文翻译）：https://github.com/XinyuLiuCs/iclr2026-oral-papers