AAAI 2026 大模型安全相关论文整理

AAAI 2026 大模型安全相关论文整理

总目录 大模型安全研究论文整理 2026年版:https://blog.csdn.net/WhiffeYF/article/details/159047894

https://claude.ai/chat/916dfe36-9753-4199-baa2-44fc2f709fb6

统计:共收集 27 篇论文,来自 AAAI 2026(第40届,2026年1月,新加坡)

主要来源:AI Alignment 特别 Track (Vol 40 No.44)和主技术 Track(NLP / ML 等)

分类概览:

  • 越狱攻击方法(Jailbreak Attack):10 篇
  • 安全防御与对齐(Defense & Alignment):10 篇
  • 安全评估与基准(Benchmark & Evaluation):4 篇
  • 隐私与数据安全:2 篇
  • 智能体安全(Agent Safety):1 篇

1 越狱攻击方法(Jailbreak Attack)

id 论文名 Track 链接
1 MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs AI Alignment PDF
2 Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment AI Alignment PDF
3 HumorReject: Decoupling LLM Safety from Refusal Prefix via a Little Humor AI Alignment PDF
4 StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak AI Alignment PDF
5 STACK: Adversarial Attacks on LLM Safeguard Pipelines AI Alignment PDF
6 Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment AI Alignment PDF
7 Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models AI Alignment PDF
8 Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models Main Track (NLP) PDF
9 Multi-Turn Jailbreaking Large Language Models via Attention Shifting Main Track (NLP) PDF
10 Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs (CognitiveAttack) Main Track (NLP) PDF

2 安全防御与对齐(Defense & Alignment)

id 论文名 Track 链接
1 AlignTree: Efficient Defense Against LLM Jailbreak Attacks AI Alignment PDF
2 EASE: Practical and Efficient Safety Alignment for Small Language Models AI Alignment PDF
3 Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training AI Alignment PDF
4 STAR-1: Safer Alignment of Reasoning LLMs with 1K Data AI Alignment PDF
5 CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing AI Alignment PDF
6 WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety Main Track (NLP) PDF
7 DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt AI Alignment PDF
8 Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks AI Alignment PDF
9 MirrorShield: Towards Dynamic Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting Main Track (NLP) dblp
10 AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs Main Track (NLP) dblp

3 安全评估与基准(Benchmark & Evaluation)

id 论文名 Track 链接
1 Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models AI Alignment PDF
2 MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks AI Alignment PDF
3 MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models Main Track PDF
4 Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding AI Alignment PDF

4 隐私与数据安全

id 论文名 Track 链接
1 CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense AI Alignment PDF
2 Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models AI Alignment PDF

5 智能体安全(Agent Safety)

id 论文名 Track 链接
1 Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems AI Alignment PDF

6 其他相关论文(对齐理论 / 推理安全 / 幻觉检测 / 可解释性等)

以下论文虽然不直接属于"攻击/防御",但与大模型安全密切相关:

id 论文名 方向 链接
1 DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs 推理冗余/Overthinking PDF
2 Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning 推理安全/CoT可靠性 PDF
3 Bolster Hallucination Detection via Prompt-Guided Data Augmentation 幻觉检测 PDF
4 Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models 幻觉/可靠性 PDF
5 Silenced Biases: The Dark Side LLMs Learned to Refuse 对齐副作用/过度拒绝 PDF
6 Beyond I'm Sorry, I Can't: Dissecting Large-Language-Model Refusal 拒绝机制分析 PDF
7 Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation 微调导致的对齐失效 PDF
8 AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment 后门攻击 PDF
9 Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning? 机器遗忘 PDF
10 Polarity-Aware Probing for Quantifying Latent Alignment in Language Models 对齐可解释性 PDF
11 FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight 推理缺陷检测 PDF
12 Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning 多模态后门攻击 PDF
13 Security Attacks on LLM-based Code Completion Tools 代码工具安全 PDF
14 MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control 智能体安全评估 PDF
15 MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies 越狱攻击 dblp
16 From Chaos to Cure: A Prefix Heuristics Guided Model-Agnostic Adaptive Detoxification Framework 去毒化防御 dblp
17 An LLM-based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains 供应链后门 AAAI 2026

备注

相关推荐
秦明月1312 小时前
电气安全回路设计实战:皮尔兹安全继电器应用
经验分享·笔记·安全·职场和发展·创业创新·学习方法
ylscode13 小时前
巨齿鲨突袭GitHub:5500余仓库沦陷,开源供应链安全防线再遭重创
运维·服务器·网络·安全·安全威胁分析
开开心心就好13 小时前
用户推荐的文件解锁与强制操作工具
安全·智能手机·pdf·scala·音视频·symfony·1024程序员节
tryqaaa_14 小时前
学习日志(三)【php语法学习,iscc校赛wp】
android·网络协议·学习·安全·web安全·web
青春喂了后端14 小时前
Go Sidecar Repository 并发锁改造:让并发请求安全地进入 Git 仓库层
git·安全·golang
祁白_15 小时前
PHP无参读取文件与RCE总结
安全·php·writeup·总结·rce
无风听海15 小时前
Cookie 深度技术指南:从原理到安全实践
安全
汤愈韬15 小时前
IP安全 SEC VPN_2
网络·网络协议·安全·网络安全·security
2401_8685347816 小时前
My Experience in the Computer Room
安全
stsdddd16 小时前
【YOLO安防防护场景安全帽-安全背心目标检测数据集】
安全·yolo·目标检测