ICLR 2026 LLM安全相关论文整理

ICLR 2026 大模型安全相关论文整理

整理日期:2026-04-11

总览

ICLR 2026 共录用 5300+ 篇论文(其中 Oral 223 篇),本文从中筛选出 约 50 篇与大模型安全直接相关的论文,涵盖 Oral 和 Poster 两个类别。

按主题分为 9 大类:

类别 篇数 说明
1. 越狱攻击 5 提示词改写、梯度优化、多臂老虎机、古典中文等越狱方法
2. 推理模型安全 5 思维链劫持、推理过程对齐、CoT 干预鲁棒性
3. 安全对齐与防御 11 RL 安全对齐、推理式防御、多语言一致性、解码时探测
4. 微调攻击 / 后门攻击 5 LoRA 后门、隐写术恶意微调、有害梯度衰减防御
5. 智能体安全 6 分解攻击监控、控制流劫持、Agent-to-Agent 安全基准
6. 多模态安全 6 VLM 越狱迁移、音频模型越狱、视觉后门攻击
7. 安全评估与基准 2 多轮越狱基准、音频可信度基准
8. 代码 / 生成安全 5 安全代码生成、水印、深度伪造检测
9. 其它相关 5 激活引导、诚实对齐、偏见放大、概念擦除
附录:Oral 安全论文 11 Constitutional Classifiers++、ASIDE、UltraBreak 等

整体来看,ICLR 2026 安全方向的论文呈现几个趋势:① 推理模型安全 成为新热点,多篇论文关注 CoT 被劫持/利用的问题;② 越狱攻防进入组合化、自动化阶段 ,字典学习、元优化等方法出现;③ 智能体安全 作为新兴方向快速增长,分解攻击、控制流劫持等受到关注;④ 安全对齐从浅层走向深层,多篇论文探索任意深度对齐和基于推理的对齐。


1 越狱攻击(Jailbreak Attacks)

# 论文名 OpenReview链接 简介
1 Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges (AMIS) poster/10008164 提出AMIS元优化框架,联合进化越狱提示和评分模板,通过双层优化实现自动化越狱
2 Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search (CC-BOS) openreview 利用古典中文语境的8维搜索空间,结合生物启发式优化生成越狱提示,对推理模型也达到100% ASR
3 Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks poster/10009061 提出"对抗性既视感"假说:未来越狱是已有对抗技能原语的组合;通过字典学习增强对未见攻击的泛化
4 One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs openreview 研究可迁移的鲁棒越狱提示生成
5 Efficient Jailbreak Attack Sequences on LLMs via Multi-Armed Bandit-Based Context Switching openreview 基于多臂老虎机的上下文切换实现高效越狱攻击序列

2 推理模型安全 / 思维链安全(Reasoning & CoT Safety)

# 论文名 OpenReview链接 简介
1 Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention (IPO) forum/2uTxLC4LmC 提出Intervened Preference Optimization (IPO),通过将合规步骤替换为安全触发器实现推理过程本身的安全对齐
2 Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check openreview 先回答后检查的推理安全对齐方法
3 Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training openreview 揭示推理训练后模型可以通过推理自行绕过安全对齐
4 AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models poster/10007590 对抗性思维链调优以增强推理模型的安全对齐鲁棒性
5 Are Reasoning LLMs Robust to Interventions on their Chain-of-Thought? poster/10008704 研究推理LLM对思维链干预的鲁棒性

3 安全对齐与防御(Safety Alignment & Defense)

# 论文名 OpenReview链接 简介
1 AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning poster/10011731 通过极简化RL(仅需二元安全标签和不到200步RL)激励模型内在安全意识,实现推理式安全对齐
2 ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning poster/10009011 提出三步推理防御流水线:策略分析→意图提取→策略安全验证,对OOD越狱攻击ASR降至0.06
3 AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint poster/10011789 基于零空间约束的拒绝引导学习
4 A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space poster/10011231 安全敏感子空间与有害抵抗零空间结合的安全护栏
5 Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth poster/10011912 解锁LLM从浅层到任意深度的内在安全对齐
6 Alignment-Weighted DPO: A Principled Reasoning Approach to Improve Safety Alignment poster/10009740 基于原理推理的加权DPO方法改进安全对齐
7 Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment poster/10006879 一次对齐、多语言受益的跨语言安全一致性方法
8 Aligning Deep Implicit Preferences by Learning to Reason Defensively poster/10008837 通过学习防御性推理对齐深层隐式偏好
9 A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models poster/10009223 扩散语言模型的任意顺序、任意步安全对齐
10 SIRL: Self-Incentivized Reinforcement Learning for Safety (基于Entropy的安全RL) openreview 发现响应熵是安全的可靠内在信号,通过熵最小化实现无外部奖励的安全增强
11 From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment openreview 将拒绝感知的注入攻击转化为安全对齐工具

4 微调攻击 / 后门攻击(Fine-tuning & Backdoor Attacks)

# 论文名 OpenReview链接 简介
1 JailbreakLoRA: Your Downloaded LoRA from Sharing Platforms might be Unsafe forum/4YgvVRoSnF 揭示共享平台上的LoRA可能包含越狱后门
2 Invisible Safety Threat: Malicious Finetuning for LLM via Steganography poster/10011363 通过隐写术实现恶意微调,模型表面安全对齐但暗中生成有害内容
3 Revisiting Backdoor Attacks on LLMs openreview 重新审视LLM后门攻击,提出隐式投毒策略,保持安全对齐的同时注入后门
4 Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence poster/10007199 通过衰减有害梯度影响来加强对有害微调的防御
5 Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study EPFL团队 安全子空间并非线性可分:微调案例研究

5 智能体安全(Agent Safety)

# 论文名 OpenReview链接 简介
1 Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems openreview 多智能体系统中控制流劫持的攻防
2 RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments openreview 面向计算机使用智能体的真实对抗测试
3 Monitoring Decomposition Attacks openreview 分解攻击监控:发现轻量级顺序监控器可有效防御分解攻击,4634个有害-良性任务对数据集
4 Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols poster/10006727 对可信监控器的自适应攻击可颠覆AI控制协议
5 A2ASecBench: A Protocol-Aware Security Benchmark for Agent-to-Agent Multi-Agent Systems poster/10010017 面向智能体间多智能体系统的协议感知安全基准
6 AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? poster/10007726 追踪LLM智能体系统中谁在诱导失败

6 多模态安全(Multimodal Safety)

# 论文名 OpenReview链接 简介
1 VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety openreview 映射多模态联合理解在AI安全中的极限
2 ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks poster/10006730 自适应红队智能体,可插拔地对VLM进行全面风险评估
3 JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models openreview 音频语言模型越狱漏洞基准,11316文本样本+245355音频样本
4 GuardAlign: Safety Alignment for Vision-Language Models via Optimal Transport openreview 基于最优传输的视觉语言模型安全对齐
5 AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization poster/10010620 通过偏好优化增强大型视觉语言模型的对抗鲁棒性
6 BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning ICLR 2026 Downloads列表 VLM具身智能体的视觉后门攻击

7 安全评估与基准(Safety Evaluation & Benchmarks)

# 论文名 OpenReview链接 简介
1 MultiBreak: Scalable Multi-Turn Jailbreak Benchmark openreview 大规模多轮越狱基准,1724个意图+多轮对抗提示,覆盖9粗粒26细粒安全类别
2 AudioTrust: Benchmarking The Multifaceted Trustworthiness of Audio Large Language Models poster 音频大语言模型多维可信度基准

8 安全相关的代码/生成安全(Code & Generation Safety)

# 论文名 OpenReview链接 简介
1 SecCoderX: Secure Code Generation via Reasoning-Based Vulnerability Reward Model openreview 基于推理的漏洞奖励模型实现安全代码生成,首次在不损害功能性的前提下提升安全率11-16%
2 Analyzing and Evaluating Unbiased Language Model Watermark poster/10011375 分析与评估无偏语言模型水印
3 An Ensemble Framework for Unbiased Language Model Watermarking poster/10007956 无偏语言模型水印的集成框架
4 All Patches Matter: Enhance AI-Generated Image Detection via Panoptic Patch Learning poster/10007395 全景补丁学习增强AI生成图像检测
5 A Rich Knowledge Space for Scalable Deepfake Detection poster/10008071 可扩展深度伪造检测的丰富知识空间

9 其它相关论文

# 论文名 OpenReview链接 简介
1 Activation Steering with a Feedback Controller poster/10006765 基于反馈控制器的激活引导(表征工程相关)
2 Annotation-Efficient Honesty Alignment via Confidence Elicitation and Calibration poster/10008495 通过置信度引出和校准实现标注高效的诚实对齐
3 Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models poster/10008156 消除语言模型重复模式的框架
4 Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems poster/10007543 多智能体系统中偏见放大的测量
5 AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models poster/10011590 扩散模型中无保留数据的鲁棒概念擦除

10 ICLR 2026 相关Workshops

# Workshop名 链接
1 Agents in the Wild: Safety, Security, and Beyond workshop/10000781
2 AI for Peace workshop/10000804
3 Algorithmic Fairness Across Alignment Procedures and Agentic Systems workshop/10000786


附录:Oral 论文中与安全相关的论文

从ICLR 2026共223篇Oral论文中,以下为与LLM安全直接相关的Oral论文(大部分你已在之前的收集中包含):

# 论文名 类型 说明
1 Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models (UltraBreak) Oral 首个同时实现跨目标通用性和跨模型迁移性的VLM越狱框架 forum/T5hD0as3jb
2 Defending LLMs Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing Oral 解码时安全感知探测防御,利用模型内部潜在安全信号进行早期检测
3 ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack Oral 激活缩放守卫,缓解定向越狱攻击
4 ASIDE: Architectural Separation of Instructions and Data in Language Models Oral 语言模型中指令与数据的架构级分离(防提示注入)
5 GAVEL: Towards Rule-Based Safety through Activation Monitoring Oral 通过激活监控实现基于规则的安全
6 Constitutional Classifiers++: Production-Grade Defenses against Universal Jailbreaks Oral Anthropic的生产级通用越狱防御(Constitutional AI升级版)
7 Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization Oral 通过零空间约束策略优化缓解安全对齐税
8 Time-To-Inconsistency: A Survival Analysis of LLM Robustness to Adversarial Attacks Oral 对LLM对抗攻击鲁棒性的生存分析
9 GhostEI-Bench: Do Mobile Agent Resilience to Environmental Injection in Dynamic On-Device Environments? Oral 移动智能体在动态设备环境中对环境注入的抗性基准
10 Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen! Oral 开源LLM微调时微调数据可能被秘密窃取
11 GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs Oral 面向多模态LLM的幻觉诱导图像生成

说明:ICLR 2026共接收5300+篇论文(223篇Oral),本列表从中筛选出与LLM安全、越狱攻击、推理安全、对齐、智能体安全、多模态安全等直接相关的论文。部分OpenReview链接为近似链接(基于论文标题搜索),请以ICLR官方虚拟网站为准。

GitHub上有完整的223篇Oral论文列表(含中文翻译):https://github.com/XinyuLiuCs/iclr2026-oral-papers

相关推荐
田八2 小时前
聊聊AI的发展史,AI的爆发并不是偶然
前端·人工智能·程序员
zandy10112 小时前
全链路可控+极致性能,衡石HENGSHI CLI重新定义企业级BI工具的AI协作能力
大数据·人工智能·ai analytics·ai native·agent-first
广州灵眸科技有限公司2 小时前
为RK3588注入澎湃算力:RK1820 AI加速卡完整适配与评测指南
linux·网络·人工智能·物联网·算法
小程故事多_802 小时前
从零吃透Transformer核心,多头注意力、残差连接与前馈网络(大白话完整版)
人工智能·深度学习·架构·aigc·transformer
xiejava10182 小时前
写了一个WebDAV的Skill解决OpenClaw AI助手跨平台协作难题
人工智能·ai编程·智能体·openclaw
zhanghongbin012 小时前
AI 采集器:Claude Code、OpenAI、LiteLLM 监控
java·前端·人工智能
AI应用实战 | RE2 小时前
012、检索器(Retrievers)核心:从向量库中智能查找信息
人工智能·算法·机器学习·langchain
IT_陈寒2 小时前
Python的列表推导式里藏了个坑,差点让我加班到凌晨
前端·人工智能·后端
Thomas.Sir2 小时前
AI 医疗之罕见病/疑难病辅助诊断系统从算法到实现【表型驱动与知识图谱推理】
人工智能·算法·ai·知识图谱