AAAI 2026 大模型安全相关论文整理

AAAI 2026 大模型安全相关论文整理

总目录 大模型安全研究论文整理 2026年版:https://blog.csdn.net/WhiffeYF/article/details/159047894

https://claude.ai/chat/916dfe36-9753-4199-baa2-44fc2f709fb6

统计:共收集 27 篇论文,来自 AAAI 2026(第40届,2026年1月,新加坡)

主要来源:AI Alignment 特别 Track (Vol 40 No.44)和主技术 Track(NLP / ML 等)

分类概览:

  • 越狱攻击方法(Jailbreak Attack):10 篇
  • 安全防御与对齐(Defense & Alignment):10 篇
  • 安全评估与基准(Benchmark & Evaluation):4 篇
  • 隐私与数据安全:2 篇
  • 智能体安全(Agent Safety):1 篇

1 越狱攻击方法(Jailbreak Attack)

id 论文名 Track 链接
1 MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs AI Alignment PDF
2 Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment AI Alignment PDF
3 HumorReject: Decoupling LLM Safety from Refusal Prefix via a Little Humor AI Alignment PDF
4 StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak AI Alignment PDF
5 STACK: Adversarial Attacks on LLM Safeguard Pipelines AI Alignment PDF
6 Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment AI Alignment PDF
7 Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models AI Alignment PDF
8 Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models Main Track (NLP) PDF
9 Multi-Turn Jailbreaking Large Language Models via Attention Shifting Main Track (NLP) PDF
10 Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs (CognitiveAttack) Main Track (NLP) PDF

2 安全防御与对齐(Defense & Alignment)

id 论文名 Track 链接
1 AlignTree: Efficient Defense Against LLM Jailbreak Attacks AI Alignment PDF
2 EASE: Practical and Efficient Safety Alignment for Small Language Models AI Alignment PDF
3 Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training AI Alignment PDF
4 STAR-1: Safer Alignment of Reasoning LLMs with 1K Data AI Alignment PDF
5 CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing AI Alignment PDF
6 WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety Main Track (NLP) PDF
7 DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt AI Alignment PDF
8 Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks AI Alignment PDF
9 MirrorShield: Towards Dynamic Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting Main Track (NLP) dblp
10 AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs Main Track (NLP) dblp

3 安全评估与基准(Benchmark & Evaluation)

id 论文名 Track 链接
1 Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models AI Alignment PDF
2 MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks AI Alignment PDF
3 MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models Main Track PDF
4 Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding AI Alignment PDF

4 隐私与数据安全

id 论文名 Track 链接
1 CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense AI Alignment PDF
2 Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models AI Alignment PDF

5 智能体安全(Agent Safety)

id 论文名 Track 链接
1 Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems AI Alignment PDF

6 其他相关论文(对齐理论 / 推理安全 / 幻觉检测 / 可解释性等)

以下论文虽然不直接属于"攻击/防御",但与大模型安全密切相关:

id 论文名 方向 链接
1 DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs 推理冗余/Overthinking PDF
2 Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning 推理安全/CoT可靠性 PDF
3 Bolster Hallucination Detection via Prompt-Guided Data Augmentation 幻觉检测 PDF
4 Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models 幻觉/可靠性 PDF
5 Silenced Biases: The Dark Side LLMs Learned to Refuse 对齐副作用/过度拒绝 PDF
6 Beyond I'm Sorry, I Can't: Dissecting Large-Language-Model Refusal 拒绝机制分析 PDF
7 Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation 微调导致的对齐失效 PDF
8 AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment 后门攻击 PDF
9 Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning? 机器遗忘 PDF
10 Polarity-Aware Probing for Quantifying Latent Alignment in Language Models 对齐可解释性 PDF
11 FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight 推理缺陷检测 PDF
12 Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning 多模态后门攻击 PDF
13 Security Attacks on LLM-based Code Completion Tools 代码工具安全 PDF
14 MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control 智能体安全评估 PDF
15 MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies 越狱攻击 dblp
16 From Chaos to Cure: A Prefix Heuristics Guided Model-Agnostic Adaptive Detoxification Framework 去毒化防御 dblp
17 An LLM-based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains 供应链后门 AAAI 2026

备注

相关推荐
数字供应链安全产品选型9 小时前
关键领域清单+SBOM:834号令下软件供应链的“精准治理“逻辑与技术落地路径
人工智能·安全
byoass14 小时前
企业云盘与设计软件深度集成:AutoCAD/Revit/SolidWorks插件开发与API集成实战
服务器·网络·数据库·安全·oracle·云计算
Fullde福德负载箱厂家14 小时前
负载箱的需求分析与规格编制:用户应知的采购前期技术准备
安全·制造
ReaF_star15 小时前
【安全】SSL证书更新操作手册(Nginx+Cloudflare+acme.sh)
nginx·安全·ssl
盟接之桥16 小时前
什么是EDI(电子数据交换)|制造业场景解决方案
大数据·网络·安全·汽车·制造
科技云报道16 小时前
安全进入“AI自主攻击”时代,瑞数信息如何用AI对抗AI
人工智能·安全
KnowSafe17 小时前
证书自动化解决方案哪家更可靠?
运维·服务器·安全·https·自动化·ssl
KnowSafe18 小时前
2026年证书自动化解决方案选型指南
运维·安全·自动化·ssl·itrustssl
b55t4ck18 小时前
FortiWeb CVE-2025-64446漏洞深入复现分析
网络·安全·iot
wanhengidc18 小时前
可持续性 云手机运行
运维·服务器·网络·安全·智能手机