EACL 2026 大模型安全相关论文整理

EACL 2026 大模型安全相关论文整理

会议信息 : EACL 2026 (第19届欧洲计算语言学协会会议)
时间 : 2026年3月24-29日
地点 : 摩洛哥拉巴特 (Rabat, Morocco)
论文集 : ACL Anthology - EACL 2026
整理日期: 2026年4月16日


一、越狱攻击 (Jailbreak Attacks)

# 论文标题 作者 来源
1 Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models Sarah Ball, Frauke Kreuter, Nina Panickssery Long Papers
2 Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models Wei Zhao, Zhe Li, Yige Li, Jun Sun Findings
3 When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models Zafir Shamsi, Nikhil Chekuru, Zachary Guzman, Shivank Garg SRW

二、对抗攻击与安全漏洞 (Adversarial Attacks & Vulnerabilities)

# 论文标题 作者 来源
1 Teams of LLM Agents can Exploit Zero-Day Vulnerabilities Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, Daniel Kang Long Papers
2 VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy Yu Cui, Sicheng Pan, Yifei Liu, Haibin Zhang, Cong Zuo Findings
3 Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents Daud Waqas, Aaryamaan Golthi, Erika Hayashida, Huanzhi Mao Industry
4 Hacking Neural Evaluation Metrics with Single Hub Text Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai Short Papers

三、安全防御与对齐 (Safety Defense & Alignment)

# 论文标题 作者 来源
1 Safety of Large Language Models Beyond English: A Systematic Literature Review of Risks, Biases, and Safeguards Aleksandra Krasnodębska, Katarzyna Dziewulska, Karolina Seweryn, Maciej Chrabaszcz, Wojciech Kusa Long Papers
2 The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMs Omar Mahmoud, Ali Khalil, Thommen George Karimpanal, Buddhika Laknath Semage, Santu Rana Findings
3 CodeGuard: Improving LLM Guardrails in CS Education Nishat Raihan, Noah Erdachew, Jayoti Devi, Joanna C. S. Santos, Marcos Zampieri Findings
4 Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety Denis Janiak, Julia Moska, Dawid Motyka, Karolina Seweryn, Paweł Walkowiak, Bartosz Żuk, Arkadiusz Janz SRW
5 The Clinical Fingerprint: Comparing the Rhetorical Integrity and Epistemic Safety of Human Physicians and Large Language Models Bayram Ayadi SRW
6 Enhancing User Safety: Context-Aware Detection of Offensive Query-Ad Pairs in Multimodal Search Advertising Gaurav Kumar, Qiangjian Xi, Tanmaya Shekhar Dabral, Hooshang Ghasemi, Abishek Krishnamoorthy, Danqing Fu, Rui Min, Emilio R. Antunez, Zhongli Ding, Pradyumna Narayana Industry

四、有害内容检测与内容审核 (Harmful Content Detection & Moderation)

# 论文标题 作者 来源
1 JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Risky Health Behavior Content Yunze Xiao, Tingyu He, Lionel Z. Wang, Yiming Ma, Xingyu Song, Xiaohang Xu, Mona T. Diab, Irene Li, Ka Chung Ng Long Papers
2 Harmful Factuality: LLMs Correcting What They Shouldn't Mingchen Li, Hanzhi Zhang, Heng Fan, Junhua Ding, Yunhe Feng Findings
3 Being Kind Isn't Always Being Safe: Diagnosing Affective Hallucination in LLMs Sewon Kim, Jiwon Kim, SeungWoo Shin, Hyejin Chung, Daeun Moon, Yejin Kwon, Hyunsoo Yoon Findings
4 When Words Wear Masks: Detecting Malicious Intents and Hostile Impacts of Online Hate Speech Priyansh Singhal, Piyush Joshi Short Papers
5 To Paraphrase or Not: Efficient Comment Detoxification with Unsupervised Detoxifiability Discrimination Jing Ke, Zheyong Xie, Shaosheng Cao, Tong Xu, Enhong Chen Short Papers
6 BigTokDetect: A Clinically-Informed Vision-Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok Minh Duc Chu, Kshitij Pawar, Zihao He, Roxanna Sharifi, Ross M. Sonnenblick, Magdalayna Curry, Laura DAdamo, Lindsay Young, Stuart Murray, Kristina Lerman Long Papers

五、隐私与数据安全 (Privacy & Data Security)

# 论文标题 作者 来源
1 Auditing Language Model Unlearning via Information Decomposition Anmol Goel, Alan Ritter, Iryna Gurevych Long Papers
2 Detecting Training Data of Large Language Models via Expectation Maximization Gyuwan Kim, Yang Li, Evangelia Spiliopoulou, Jie Ma, William Yang Wang Long Papers
3 Personal Information Parroting in Language Models Nishant Subramani, Kshitish Ghate, Mona T. Diab Findings
4 The Model's Language Matters: A Comparative Privacy Analysis of LLMs Abhishek Kumar Mishra, Antoine Boutet, Lucas Magnana Findings
5 Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs Honghao Liu, Xuhui Jiang, Chengjin Xu, Cehao Yang, Yiran Cheng, Lionel Ni, Jian Guo Findings
6 OD-Stega: LLM-Based Relatively Secure Steganography via Optimized Distributions Yu-Shin Huang, Peter Just, Hanyun Yin, Krishna Narayanan, Ruihong Huang, Chao Tian Long Papers

六、偏见与公平性 (Bias & Fairness)

# 论文标题 作者 来源
1 How Quantization Shapes Bias in Large Language Models Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych Long Papers
2 Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs Zara Siddique, Irtaza Khalid, Liam Turner, Luis Espinosa-Anke Findings
3 Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models David Guzman Piedrahita, Irene Strauss, Rada Mihalcea, Zhijing Jin Long Papers
4 Do Political Opinions Transfer Between Western Languages? Franziska Weeber, Tanise Ceron, Sebastian Padó Long Papers
5 Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language Models Sumanth Manduru, Carlotta Domeniconi SRW
6 Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors Adnan Al Ali, Jindřich Helcl, Jindřich Libovický SRW
7 SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context Aishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran, Sunipa Dev Short Papers
8 MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Jolie Nguyen, Shivank Garg, Kevin Zhu, Vasu Sharma Short Papers
9 Common Sense or Ableism? Rethinking Commonsense Reasoning Through the Lens of Disability Karina H Halevy, Kimi Wenzel, Seyun Kim, Kyle Dean Bauer, Bruno Neira, Mona T. Diab, Maarten Sap Short Papers
10 On the Interplay between Human Label Variation and Model Fairness Kemal Kurniawan, Meladel Mistica, Timothy Baldwin, Jey Han Lau Findings

七、虚假信息与舆论操控 (Misinformation & Manipulation)

# 论文标题 作者 来源
1 PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European Media Michele Joshua Maggini, Paloma Piot, Anxo Pérez, Erik Bran Marino, Lúa Santamaría Montesinos, Ana Lisboa Cotovio, Marta Vázquez Abuín, Javier Parapar, Pablo Gamallo Long Papers
2 Entity-aware Cross-lingual Claim Detection for Automated Fact-checking Rrubaa Panchendrarajan, Arkaitz Zubiaga Findings
3 ART: Adaptive Reasoning Trees for Explainable Claim Verification Sahil Wadhwa, Himanshu Kumar, Guanqun Yang, Abbaas Alif Mohamed Nishar, Pranab Mohanty, Swapnil Shinde, Yue Wu Findings
4 Fake News Detection Strategies under Dataset Bias: Using Large-scale Coarse-grained Labels Yuki Kishi, Yuji Arima, Hitoshi Iyatomi SRW
5 Tailoring Rumor Debunking to You: Diversifying Chinese Rumor-Debunking Passages with an LLM-Driven Simulated Feedback-Enhanced Framework Xinle Pang, Danding Wang, Qiang Sheng, Yifan Sun, Beizhe Hu, Juan Cao Industry

八、Agent安全与多智能体安全 (Agent & Multi-Agent Safety)

# 论文标题 作者 来源
1 MAPS: A Multilingual Benchmark for Agent Performance and Security Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Trisha Starostina, Kelly Marchisio, Seraphina Goldfarb-Tarrant, Roman Vainshtein Findings
2 The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems Devang Kulshreshtha, Wanyu Du, Raghav Jain, Srikanth Doss, Hang Su, Sandesh Swamy, Yanjun Qi Industry
3 Don't Trust Generative Agents to Mimic Communication on Social Networks (详见ACL Anthology) Long Papers

统计总览

类别 论文数量
越狱攻击 3
对抗攻击与安全漏洞 4
安全防御与对齐 6
有害内容检测与内容审核 6
隐私与数据安全 6
偏见与公平性 10
虚假信息与舆论操控 5
Agent安全与多智能体安全 3
总计 43

: 部分论文可能跨越多个类别,此处按最主要的研究方向进行分类。论文来源标注说明:Long Papers = 主会长文 (Vol.1), Short Papers = 主会短文 (Vol.2), Findings = Findings of ACL, SRW = 学生研究研讨会 (Vol.4), Industry = 工业界 (Vol.5)。

相关推荐
曼岛_2 小时前
[网络安全] Linux权限维持-隐藏篇
linux·安全·web安全·安全威胁分析
恋恋风尘hhh2 小时前
文字点选验证码前端安全研究:以网易易盾(dun.163)为例
前端·安全
honest_gg2 小时前
潜影【TraceHarvest】:自动化“一键”钓鱼工具
安全·hw·社会工程学·hvv·钓鱼·攻防演练·护网
Flittly2 小时前
【SpringSecurity新手村系列】(1)初识安全框架
java·spring boot·安全·spring·安全架构
测试那点事儿2 小时前
Cursor AI技能提示词设计建议:构建全覆盖测试用例生成体系(测试用例设计场景安全性能篇)
人工智能·安全·测试用例·ai辅助测试
JS_SWKJ2 小时前
攻防升级下,网闸与光闸的创新突围——2026年网络安全隔离行业新趋势及产品实践
网络·安全·web安全
techdashen2 小时前
Rust 在安全关键软件:机遇、挑战与未来之路
人工智能·安全·rust
亚马逊云开发者2 小时前
Claude Mythos Preview 来了:Anthropic 网络安全专用大模型在 Amazon Bedrock 上开放申请,代码审计要变天了
安全·web安全·状态模式
数字哨兵(和中)2 小时前
9.8 分高危漏洞复现:CVE-2026-27944 Nginx UI 信息泄露漏洞
安全·web安全