EACL 2026 大模型安全相关论文整理

EACL 2026 大模型安全相关论文整理

会议信息 : EACL 2026 (第19届欧洲计算语言学协会会议)
时间 : 2026年3月24-29日
地点 : 摩洛哥拉巴特 (Rabat, Morocco)
论文集 : ACL Anthology - EACL 2026
整理日期: 2026年4月16日


一、越狱攻击 (Jailbreak Attacks)

# 论文标题 作者 来源
1 Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models Sarah Ball, Frauke Kreuter, Nina Panickssery Long Papers
2 Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models Wei Zhao, Zhe Li, Yige Li, Jun Sun Findings
3 When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models Zafir Shamsi, Nikhil Chekuru, Zachary Guzman, Shivank Garg SRW

二、对抗攻击与安全漏洞 (Adversarial Attacks & Vulnerabilities)

# 论文标题 作者 来源
1 Teams of LLM Agents can Exploit Zero-Day Vulnerabilities Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, Daniel Kang Long Papers
2 VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy Yu Cui, Sicheng Pan, Yifei Liu, Haibin Zhang, Cong Zuo Findings
3 Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents Daud Waqas, Aaryamaan Golthi, Erika Hayashida, Huanzhi Mao Industry
4 Hacking Neural Evaluation Metrics with Single Hub Text Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai Short Papers

三、安全防御与对齐 (Safety Defense & Alignment)

# 论文标题 作者 来源
1 Safety of Large Language Models Beyond English: A Systematic Literature Review of Risks, Biases, and Safeguards Aleksandra Krasnodębska, Katarzyna Dziewulska, Karolina Seweryn, Maciej Chrabaszcz, Wojciech Kusa Long Papers
2 The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMs Omar Mahmoud, Ali Khalil, Thommen George Karimpanal, Buddhika Laknath Semage, Santu Rana Findings
3 CodeGuard: Improving LLM Guardrails in CS Education Nishat Raihan, Noah Erdachew, Jayoti Devi, Joanna C. S. Santos, Marcos Zampieri Findings
4 Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety Denis Janiak, Julia Moska, Dawid Motyka, Karolina Seweryn, Paweł Walkowiak, Bartosz Żuk, Arkadiusz Janz SRW
5 The Clinical Fingerprint: Comparing the Rhetorical Integrity and Epistemic Safety of Human Physicians and Large Language Models Bayram Ayadi SRW
6 Enhancing User Safety: Context-Aware Detection of Offensive Query-Ad Pairs in Multimodal Search Advertising Gaurav Kumar, Qiangjian Xi, Tanmaya Shekhar Dabral, Hooshang Ghasemi, Abishek Krishnamoorthy, Danqing Fu, Rui Min, Emilio R. Antunez, Zhongli Ding, Pradyumna Narayana Industry

四、有害内容检测与内容审核 (Harmful Content Detection & Moderation)

# 论文标题 作者 来源
1 JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Risky Health Behavior Content Yunze Xiao, Tingyu He, Lionel Z. Wang, Yiming Ma, Xingyu Song, Xiaohang Xu, Mona T. Diab, Irene Li, Ka Chung Ng Long Papers
2 Harmful Factuality: LLMs Correcting What They Shouldn't Mingchen Li, Hanzhi Zhang, Heng Fan, Junhua Ding, Yunhe Feng Findings
3 Being Kind Isn't Always Being Safe: Diagnosing Affective Hallucination in LLMs Sewon Kim, Jiwon Kim, SeungWoo Shin, Hyejin Chung, Daeun Moon, Yejin Kwon, Hyunsoo Yoon Findings
4 When Words Wear Masks: Detecting Malicious Intents and Hostile Impacts of Online Hate Speech Priyansh Singhal, Piyush Joshi Short Papers
5 To Paraphrase or Not: Efficient Comment Detoxification with Unsupervised Detoxifiability Discrimination Jing Ke, Zheyong Xie, Shaosheng Cao, Tong Xu, Enhong Chen Short Papers
6 BigTokDetect: A Clinically-Informed Vision-Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok Minh Duc Chu, Kshitij Pawar, Zihao He, Roxanna Sharifi, Ross M. Sonnenblick, Magdalayna Curry, Laura DAdamo, Lindsay Young, Stuart Murray, Kristina Lerman Long Papers

五、隐私与数据安全 (Privacy & Data Security)

# 论文标题 作者 来源
1 Auditing Language Model Unlearning via Information Decomposition Anmol Goel, Alan Ritter, Iryna Gurevych Long Papers
2 Detecting Training Data of Large Language Models via Expectation Maximization Gyuwan Kim, Yang Li, Evangelia Spiliopoulou, Jie Ma, William Yang Wang Long Papers
3 Personal Information Parroting in Language Models Nishant Subramani, Kshitish Ghate, Mona T. Diab Findings
4 The Model's Language Matters: A Comparative Privacy Analysis of LLMs Abhishek Kumar Mishra, Antoine Boutet, Lucas Magnana Findings
5 Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs Honghao Liu, Xuhui Jiang, Chengjin Xu, Cehao Yang, Yiran Cheng, Lionel Ni, Jian Guo Findings
6 OD-Stega: LLM-Based Relatively Secure Steganography via Optimized Distributions Yu-Shin Huang, Peter Just, Hanyun Yin, Krishna Narayanan, Ruihong Huang, Chao Tian Long Papers

六、偏见与公平性 (Bias & Fairness)

# 论文标题 作者 来源
1 How Quantization Shapes Bias in Large Language Models Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych Long Papers
2 Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs Zara Siddique, Irtaza Khalid, Liam Turner, Luis Espinosa-Anke Findings
3 Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models David Guzman Piedrahita, Irene Strauss, Rada Mihalcea, Zhijing Jin Long Papers
4 Do Political Opinions Transfer Between Western Languages? Franziska Weeber, Tanise Ceron, Sebastian Padó Long Papers
5 Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language Models Sumanth Manduru, Carlotta Domeniconi SRW
6 Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors Adnan Al Ali, Jindřich Helcl, Jindřich Libovický SRW
7 SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context Aishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran, Sunipa Dev Short Papers
8 MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Jolie Nguyen, Shivank Garg, Kevin Zhu, Vasu Sharma Short Papers
9 Common Sense or Ableism? Rethinking Commonsense Reasoning Through the Lens of Disability Karina H Halevy, Kimi Wenzel, Seyun Kim, Kyle Dean Bauer, Bruno Neira, Mona T. Diab, Maarten Sap Short Papers
10 On the Interplay between Human Label Variation and Model Fairness Kemal Kurniawan, Meladel Mistica, Timothy Baldwin, Jey Han Lau Findings

七、虚假信息与舆论操控 (Misinformation & Manipulation)

# 论文标题 作者 来源
1 PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European Media Michele Joshua Maggini, Paloma Piot, Anxo Pérez, Erik Bran Marino, Lúa Santamaría Montesinos, Ana Lisboa Cotovio, Marta Vázquez Abuín, Javier Parapar, Pablo Gamallo Long Papers
2 Entity-aware Cross-lingual Claim Detection for Automated Fact-checking Rrubaa Panchendrarajan, Arkaitz Zubiaga Findings
3 ART: Adaptive Reasoning Trees for Explainable Claim Verification Sahil Wadhwa, Himanshu Kumar, Guanqun Yang, Abbaas Alif Mohamed Nishar, Pranab Mohanty, Swapnil Shinde, Yue Wu Findings
4 Fake News Detection Strategies under Dataset Bias: Using Large-scale Coarse-grained Labels Yuki Kishi, Yuji Arima, Hitoshi Iyatomi SRW
5 Tailoring Rumor Debunking to You: Diversifying Chinese Rumor-Debunking Passages with an LLM-Driven Simulated Feedback-Enhanced Framework Xinle Pang, Danding Wang, Qiang Sheng, Yifan Sun, Beizhe Hu, Juan Cao Industry

八、Agent安全与多智能体安全 (Agent & Multi-Agent Safety)

# 论文标题 作者 来源
1 MAPS: A Multilingual Benchmark for Agent Performance and Security Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Trisha Starostina, Kelly Marchisio, Seraphina Goldfarb-Tarrant, Roman Vainshtein Findings
2 The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems Devang Kulshreshtha, Wanyu Du, Raghav Jain, Srikanth Doss, Hang Su, Sandesh Swamy, Yanjun Qi Industry
3 Don't Trust Generative Agents to Mimic Communication on Social Networks (详见ACL Anthology) Long Papers

统计总览

类别 论文数量
越狱攻击 3
对抗攻击与安全漏洞 4
安全防御与对齐 6
有害内容检测与内容审核 6
隐私与数据安全 6
偏见与公平性 10
虚假信息与舆论操控 5
Agent安全与多智能体安全 3
总计 43

: 部分论文可能跨越多个类别,此处按最主要的研究方向进行分类。论文来源标注说明:Long Papers = 主会长文 (Vol.1), Short Papers = 主会短文 (Vol.2), Findings = Findings of ACL, SRW = 学生研究研讨会 (Vol.4), Industry = 工业界 (Vol.5)。

相关推荐
他是龙55112 小时前
71:Python安全 & 反序列化 & PYC反编译 & 格式化字符串安全
开发语言·python·安全
幸福巡礼14 小时前
【LangChain 1.2 实战(八)】Agent Middleware 实战 —— 动态路由、监控、安全与容错
java·安全·langchain
xixixi7777714 小时前
深度解读:网信办“清朗·整治AI应用乱象”专项行动,AI产业告别野蛮生长,全面迈入合规治理深水区
人工智能·安全·ai·大模型·合规·深度伪造·网信办
weixin_lizhao15 小时前
50天独立打造企业级API网关(二):安全防护体系与弹性设计
java·spring boot·安全·spring cloud·gateway
記億揺晃着的那天15 小时前
Claude Code 系统提示词里的安全底线:OWASP Top 10
安全·ai·ai编程·vibe coding·claude code
重明链迹实验室16 小时前
重明链迹丨每周区块链安全要闻(0427-0503)
安全·web3·区块链
百度安全17 小时前
HugeGraph 晋升 Apache 顶级项目 百度安全持续筑牢 AI 时代图数据基础设施
数据库·人工智能·安全·知识图谱
Bruce_Liuxiaowei18 小时前
AI投毒产业链曝光:安全工程师怎么看、怎么防
人工智能·安全·ai·投毒
Amy1870211182319 小时前
虚拟电厂为什么必须“牵手”微电网?答案全在这里
分布式·安全·能源
前端不太难20 小时前
给AI装上“安全缰绳”:OpenClaw与Co-Sight的信任协作
人工智能·安全·状态模式