AAAI 2026 大模型安全相关论文整理

AAAI 2026 大模型安全相关论文整理

总目录 大模型安全研究论文整理 2026年版:https://blog.csdn.net/WhiffeYF/article/details/159047894

https://claude.ai/chat/916dfe36-9753-4199-baa2-44fc2f709fb6

统计:共收集 27 篇论文,来自 AAAI 2026(第40届,2026年1月,新加坡)

主要来源:AI Alignment 特别 Track (Vol 40 No.44)和主技术 Track(NLP / ML 等)

分类概览:

  • 越狱攻击方法(Jailbreak Attack):10 篇
  • 安全防御与对齐(Defense & Alignment):10 篇
  • 安全评估与基准(Benchmark & Evaluation):4 篇
  • 隐私与数据安全:2 篇
  • 智能体安全(Agent Safety):1 篇

1 越狱攻击方法(Jailbreak Attack)

id 论文名 Track 链接
1 MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs AI Alignment PDF
2 Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment AI Alignment PDF
3 HumorReject: Decoupling LLM Safety from Refusal Prefix via a Little Humor AI Alignment PDF
4 StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak AI Alignment PDF
5 STACK: Adversarial Attacks on LLM Safeguard Pipelines AI Alignment PDF
6 Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment AI Alignment PDF
7 Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models AI Alignment PDF
8 Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models Main Track (NLP) PDF
9 Multi-Turn Jailbreaking Large Language Models via Attention Shifting Main Track (NLP) PDF
10 Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs (CognitiveAttack) Main Track (NLP) PDF

2 安全防御与对齐(Defense & Alignment)

id 论文名 Track 链接
1 AlignTree: Efficient Defense Against LLM Jailbreak Attacks AI Alignment PDF
2 EASE: Practical and Efficient Safety Alignment for Small Language Models AI Alignment PDF
3 Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training AI Alignment PDF
4 STAR-1: Safer Alignment of Reasoning LLMs with 1K Data AI Alignment PDF
5 CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing AI Alignment PDF
6 WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety Main Track (NLP) PDF
7 DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt AI Alignment PDF
8 Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks AI Alignment PDF
9 MirrorShield: Towards Dynamic Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting Main Track (NLP) dblp
10 AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs Main Track (NLP) dblp

3 安全评估与基准(Benchmark & Evaluation)

id 论文名 Track 链接
1 Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models AI Alignment PDF
2 MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks AI Alignment PDF
3 MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models Main Track PDF
4 Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding AI Alignment PDF

4 隐私与数据安全

id 论文名 Track 链接
1 CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense AI Alignment PDF
2 Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models AI Alignment PDF

5 智能体安全(Agent Safety)

id 论文名 Track 链接
1 Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems AI Alignment PDF

6 其他相关论文(对齐理论 / 推理安全 / 幻觉检测 / 可解释性等)

以下论文虽然不直接属于"攻击/防御",但与大模型安全密切相关:

id 论文名 方向 链接
1 DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs 推理冗余/Overthinking PDF
2 Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning 推理安全/CoT可靠性 PDF
3 Bolster Hallucination Detection via Prompt-Guided Data Augmentation 幻觉检测 PDF
4 Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models 幻觉/可靠性 PDF
5 Silenced Biases: The Dark Side LLMs Learned to Refuse 对齐副作用/过度拒绝 PDF
6 Beyond I'm Sorry, I Can't: Dissecting Large-Language-Model Refusal 拒绝机制分析 PDF
7 Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation 微调导致的对齐失效 PDF
8 AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment 后门攻击 PDF
9 Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning? 机器遗忘 PDF
10 Polarity-Aware Probing for Quantifying Latent Alignment in Language Models 对齐可解释性 PDF
11 FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight 推理缺陷检测 PDF
12 Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning 多模态后门攻击 PDF
13 Security Attacks on LLM-based Code Completion Tools 代码工具安全 PDF
14 MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control 智能体安全评估 PDF
15 MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies 越狱攻击 dblp
16 From Chaos to Cure: A Prefix Heuristics Guided Model-Agnostic Adaptive Detoxification Framework 去毒化防御 dblp
17 An LLM-based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains 供应链后门 AAAI 2026

备注

相关推荐
2501_948114242 小时前
Claude Sonnet 4.6 深度评测:性能逼近 Opus、成本打骨折,附接入方案与选型指南
大数据·网络·人工智能·安全·架构
humors2212 小时前
一些安全类网站(不定期更新)
linux·网络·windows·安全·黑客·白帽
喵叔哟2 小时前
6.【.NET10 实战--孢子记账--产品智能化】--认证与安全包
python·安全·flask
indexsunny2 小时前
互联网大厂Java面试实战:从Spring Boot到微服务架构的深度探讨
java·数据库·spring boot·安全·微服务·监控·面试实战
_李小白2 小时前
【OSG学习笔记】Day 42: OSG 动态场景安全修改
笔记·学习·安全
RFID舜识物联网3 小时前
耐高温RFID技术如何解决汽车涂装车间管理难题?
大数据·人工智能·嵌入式硬件·物联网·安全·信息与通信
武帝为此4 小时前
【Rabbit加密算法介绍】
算法·安全
humors2214 小时前
一些反恶意软件安全程序汇总
安全·杀毒·木马·反恶意软件·蠕虫·反流氓软件·反间谍软件
小二·4 小时前
2026年4月技术热点深度解析:AI智能体攻防、量子安全与云原生新纪元
人工智能·安全·云原生