ICLR 2026 大模型安全相关论文整理
整理日期:2026-04-11
总览
ICLR 2026 共录用 5300+ 篇论文(其中 Oral 223 篇),本文从中筛选出 约 50 篇与大模型安全直接相关的论文,涵盖 Oral 和 Poster 两个类别。
按主题分为 9 大类:
| 类别 | 篇数 | 说明 |
|---|---|---|
| 1. 越狱攻击 | 5 | 提示词改写、梯度优化、多臂老虎机、古典中文等越狱方法 |
| 2. 推理模型安全 | 5 | 思维链劫持、推理过程对齐、CoT 干预鲁棒性 |
| 3. 安全对齐与防御 | 11 | RL 安全对齐、推理式防御、多语言一致性、解码时探测 |
| 4. 微调攻击 / 后门攻击 | 5 | LoRA 后门、隐写术恶意微调、有害梯度衰减防御 |
| 5. 智能体安全 | 6 | 分解攻击监控、控制流劫持、Agent-to-Agent 安全基准 |
| 6. 多模态安全 | 6 | VLM 越狱迁移、音频模型越狱、视觉后门攻击 |
| 7. 安全评估与基准 | 2 | 多轮越狱基准、音频可信度基准 |
| 8. 代码 / 生成安全 | 5 | 安全代码生成、水印、深度伪造检测 |
| 9. 其它相关 | 5 | 激活引导、诚实对齐、偏见放大、概念擦除 |
| 附录:Oral 安全论文 | 11 | Constitutional Classifiers++、ASIDE、UltraBreak 等 |
整体来看,ICLR 2026 安全方向的论文呈现几个趋势:① 推理模型安全 成为新热点,多篇论文关注 CoT 被劫持/利用的问题;② 越狱攻防进入组合化、自动化阶段 ,字典学习、元优化等方法出现;③ 智能体安全 作为新兴方向快速增长,分解攻击、控制流劫持等受到关注;④ 安全对齐从浅层走向深层,多篇论文探索任意深度对齐和基于推理的对齐。
1 越狱攻击(Jailbreak Attacks)
| # | 论文名 | OpenReview链接 | 简介 |
|---|---|---|---|
| 1 | Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges (AMIS) | poster/10008164 | 提出AMIS元优化框架,联合进化越狱提示和评分模板,通过双层优化实现自动化越狱 |
| 2 | Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search (CC-BOS) | openreview | 利用古典中文语境的8维搜索空间,结合生物启发式优化生成越狱提示,对推理模型也达到100% ASR |
| 3 | Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks | poster/10009061 | 提出"对抗性既视感"假说:未来越狱是已有对抗技能原语的组合;通过字典学习增强对未见攻击的泛化 |
| 4 | One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs | openreview | 研究可迁移的鲁棒越狱提示生成 |
| 5 | Efficient Jailbreak Attack Sequences on LLMs via Multi-Armed Bandit-Based Context Switching | openreview | 基于多臂老虎机的上下文切换实现高效越狱攻击序列 |
2 推理模型安全 / 思维链安全(Reasoning & CoT Safety)
| # | 论文名 | OpenReview链接 | 简介 |
|---|---|---|---|
| 1 | Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention (IPO) | forum/2uTxLC4LmC | 提出Intervened Preference Optimization (IPO),通过将合规步骤替换为安全触发器实现推理过程本身的安全对齐 |
| 2 | Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check | openreview | 先回答后检查的推理安全对齐方法 |
| 3 | Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training | openreview | 揭示推理训练后模型可以通过推理自行绕过安全对齐 |
| 4 | AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models | poster/10007590 | 对抗性思维链调优以增强推理模型的安全对齐鲁棒性 |
| 5 | Are Reasoning LLMs Robust to Interventions on their Chain-of-Thought? | poster/10008704 | 研究推理LLM对思维链干预的鲁棒性 |
3 安全对齐与防御(Safety Alignment & Defense)
| # | 论文名 | OpenReview链接 | 简介 |
|---|---|---|---|
| 1 | AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning | poster/10011731 | 通过极简化RL(仅需二元安全标签和不到200步RL)激励模型内在安全意识,实现推理式安全对齐 |
| 2 | ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning | poster/10009011 | 提出三步推理防御流水线:策略分析→意图提取→策略安全验证,对OOD越狱攻击ASR降至0.06 |
| 3 | AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint | poster/10011789 | 基于零空间约束的拒绝引导学习 |
| 4 | A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space | poster/10011231 | 安全敏感子空间与有害抵抗零空间结合的安全护栏 |
| 5 | Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth | poster/10011912 | 解锁LLM从浅层到任意深度的内在安全对齐 |
| 6 | Alignment-Weighted DPO: A Principled Reasoning Approach to Improve Safety Alignment | poster/10009740 | 基于原理推理的加权DPO方法改进安全对齐 |
| 7 | Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment | poster/10006879 | 一次对齐、多语言受益的跨语言安全一致性方法 |
| 8 | Aligning Deep Implicit Preferences by Learning to Reason Defensively | poster/10008837 | 通过学习防御性推理对齐深层隐式偏好 |
| 9 | A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models | poster/10009223 | 扩散语言模型的任意顺序、任意步安全对齐 |
| 10 | SIRL: Self-Incentivized Reinforcement Learning for Safety (基于Entropy的安全RL) | openreview | 发现响应熵是安全的可靠内在信号,通过熵最小化实现无外部奖励的安全增强 |
| 11 | From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment | openreview | 将拒绝感知的注入攻击转化为安全对齐工具 |
4 微调攻击 / 后门攻击(Fine-tuning & Backdoor Attacks)
| # | 论文名 | OpenReview链接 | 简介 |
|---|---|---|---|
| 1 | JailbreakLoRA: Your Downloaded LoRA from Sharing Platforms might be Unsafe | forum/4YgvVRoSnF | 揭示共享平台上的LoRA可能包含越狱后门 |
| 2 | Invisible Safety Threat: Malicious Finetuning for LLM via Steganography | poster/10011363 | 通过隐写术实现恶意微调,模型表面安全对齐但暗中生成有害内容 |
| 3 | Revisiting Backdoor Attacks on LLMs | openreview | 重新审视LLM后门攻击,提出隐式投毒策略,保持安全对齐的同时注入后门 |
| 4 | Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence | poster/10007199 | 通过衰减有害梯度影响来加强对有害微调的防御 |
| 5 | Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study | EPFL团队 | 安全子空间并非线性可分:微调案例研究 |
5 智能体安全(Agent Safety)
| # | 论文名 | OpenReview链接 | 简介 |
|---|---|---|---|
| 1 | Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems | openreview | 多智能体系统中控制流劫持的攻防 |
| 2 | RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments | openreview | 面向计算机使用智能体的真实对抗测试 |
| 3 | Monitoring Decomposition Attacks | openreview | 分解攻击监控:发现轻量级顺序监控器可有效防御分解攻击,4634个有害-良性任务对数据集 |
| 4 | Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols | poster/10006727 | 对可信监控器的自适应攻击可颠覆AI控制协议 |
| 5 | A2ASecBench: A Protocol-Aware Security Benchmark for Agent-to-Agent Multi-Agent Systems | poster/10010017 | 面向智能体间多智能体系统的协议感知安全基准 |
| 6 | AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? | poster/10007726 | 追踪LLM智能体系统中谁在诱导失败 |
6 多模态安全(Multimodal Safety)
| # | 论文名 | OpenReview链接 | 简介 |
|---|---|---|---|
| 1 | VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety | openreview | 映射多模态联合理解在AI安全中的极限 |
| 2 | ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks | poster/10006730 | 自适应红队智能体,可插拔地对VLM进行全面风险评估 |
| 3 | JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models | openreview | 音频语言模型越狱漏洞基准,11316文本样本+245355音频样本 |
| 4 | GuardAlign: Safety Alignment for Vision-Language Models via Optimal Transport | openreview | 基于最优传输的视觉语言模型安全对齐 |
| 5 | AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization | poster/10010620 | 通过偏好优化增强大型视觉语言模型的对抗鲁棒性 |
| 6 | BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning | ICLR 2026 Downloads列表 | VLM具身智能体的视觉后门攻击 |
7 安全评估与基准(Safety Evaluation & Benchmarks)
| # | 论文名 | OpenReview链接 | 简介 |
|---|---|---|---|
| 1 | MultiBreak: Scalable Multi-Turn Jailbreak Benchmark | openreview | 大规模多轮越狱基准,1724个意图+多轮对抗提示,覆盖9粗粒26细粒安全类别 |
| 2 | AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models | poster | 音频大语言模型多维可信度基准 |
8 安全相关的代码/生成安全(Code & Generation Safety)
| # | 论文名 | OpenReview链接 | 简介 |
|---|---|---|---|
| 1 | SecCoderX: Secure Code Generation via Reasoning-Based Vulnerability Reward Model | openreview | 基于推理的漏洞奖励模型实现安全代码生成,首次在不损害功能性的前提下提升安全率11-16% |
| 2 | Analyzing and Evaluating Unbiased Language Model Watermark | poster/10011375 | 分析与评估无偏语言模型水印 |
| 3 | An Ensemble Framework for Unbiased Language Model Watermarking | poster/10007956 | 无偏语言模型水印的集成框架 |
| 4 | All Patches Matter: Enhance AI-Generated Image Detection via Panoptic Patch Learning | poster/10007395 | 全景补丁学习增强AI生成图像检测 |
| 5 | A Rich Knowledge Space for Scalable Deepfake Detection | poster/10008071 | 可扩展深度伪造检测的丰富知识空间 |
9 其它相关论文
| # | 论文名 | OpenReview链接 | 简介 |
|---|---|---|---|
| 1 | Activation Steering with a Feedback Controller | poster/10006765 | 基于反馈控制器的激活引导(表征工程相关) |
| 2 | Annotation-Efficient Honesty Alignment via Confidence Elicitation and Calibration | poster/10008495 | 通过置信度引出和校准实现标注高效的诚实对齐 |
| 3 | Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models | poster/10008156 | 消除语言模型重复模式的框架 |
| 4 | Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems | poster/10007543 | 多智能体系统中偏见放大的测量 |
| 5 | AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models | poster/10011590 | 扩散模型中无保留数据的鲁棒概念擦除 |
10 ICLR 2026 相关Workshops
| # | Workshop名 | 链接 |
|---|---|---|
| 1 | Agents in the Wild: Safety, Security, and Beyond | workshop/10000781 |
| 2 | AI for Peace | workshop/10000804 |
| 3 | Algorithmic Fairness Across Alignment Procedures and Agentic Systems | workshop/10000786 |
附录:Oral 论文中与安全相关的论文
从ICLR 2026共223篇Oral论文中,以下为与LLM安全直接相关的Oral论文(大部分你已在之前的收集中包含):
| # | 论文名 | 类型 | 说明 |
|---|---|---|---|
| 1 | Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models (UltraBreak) | Oral | 首个同时实现跨目标通用性和跨模型迁移性的VLM越狱框架 forum/T5hD0as3jb |
| 2 | Defending LLMs Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing | Oral | 解码时安全感知探测防御,利用模型内部潜在安全信号进行早期检测 |
| 3 | ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack | Oral | 激活缩放守卫,缓解定向越狱攻击 |
| 4 | ASIDE: Architectural Separation of Instructions and Data in Language Models | Oral | 语言模型中指令与数据的架构级分离(防提示注入) |
| 5 | GAVEL: Towards Rule-Based Safety through Activation Monitoring | Oral | 通过激活监控实现基于规则的安全 |
| 6 | Constitutional Classifiers++: Production-Grade Defenses against Universal Jailbreaks | Oral | Anthropic的生产级通用越狱防御(Constitutional AI升级版) |
| 7 | Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization | Oral | 通过零空间约束策略优化缓解安全对齐税 |
| 8 | Time-To-Inconsistency: A Survival Analysis of LLM Robustness to Adversarial Attacks | Oral | 对LLM对抗攻击鲁棒性的生存分析 |
| 9 | GhostEI-Bench: Do Mobile Agent Resilience to Environmental Injection in Dynamic On-Device Environments? | Oral | 移动智能体在动态设备环境中对环境注入的抗性基准 |
| 10 | Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen! | Oral | 开源LLM微调时微调数据可能被秘密窃取 |
| 11 | GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs | Oral | 面向多模态LLM的幻觉诱导图像生成 |
说明:ICLR 2026共接收5300+篇论文(223篇Oral),本列表从中筛选出与LLM安全、越狱攻击、推理安全、对齐、智能体安全、多模态安全等直接相关的论文。部分OpenReview链接为近似链接(基于论文标题搜索),请以ICLR官方虚拟网站为准。
GitHub上有完整的223篇Oral论文列表(含中文翻译):https://github.com/XinyuLiuCs/iclr2026-oral-papers