总目录 大模型相关研究:https://blog.csdn.net/WhiffeYF/article/details/142132328
| id | 论文名 | 年 | 等级 | 期刊/会议 |
|---|---|---|---|---|
| 1 | Larger and more instructable language models become less reliable | 2024 | 1 | nature |
| 2 | CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion | 2024 | A | ACL |
| 3 | Advancing LLM Safe Alignment with Safety Representation Ranking | 2025 | 无 | ICML Workshop |
| 4 | Improved Generation of Adversarial Examples Against Safety-aligned LLMs | 2024 | A | NeurIPS |
| 5 | From text to multimodal: a survey of adversarial example generation in question answering systems | 2024 | 4 | Knowledge and Information Systems |
| 6 | Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer? | 2025 | B | NAACL |
| 7 | Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback | 2025 | B | Information Processing & Management |
| 8 | Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations | 2025 | 无 | arxiv |
https://www.doubao.com/chat/19972227671365378
https://chatgpt.com/c/68c1da62-7fc0-832e-9934-b6a7ef381fd9
题目生成安全研究重要(几点可直接用的研究思路)
-
攻击面(如何让模型生成"错误/偏见/有害"题)
利用对抗生成方法(离散 token 替换、梯度启发式搜索、prompt-jailbreak、多示例诱导)来构造题目,使得生成结果包含事实错误、暗含偏见或鼓励危险行为。参考:NeurIPS 2024、对抗样本综述。
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
From text to multimodal: a survey of adversarial example generation in question answering systems -
生成-验证不一致(模型会生成自己也答不出来的题)
设计"生成后验证"流程(生成题目 → 同/异模型验证是否可答 / 是否存在多个正确答案),并用该流程判定"错误性题目"。
Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer? -
偏见题构造与检测
构建偏见题模板(性别/种族/阶级/文化敏感话题),通过语法/语义变换扩展(借鉴 JADE 型方法),评估不同模型在题目生成时露出的系统性偏差。
Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations -
评估指标与数据集
可用指标:错误率(事实/逻辑)、不可答率(unanswerable)、有害性评分(自动 + 人工标签)、偏见强度(差异化统计)、选项/答案位置偏置、可解释性度量等。
RobustQA: A Framework for Adversarial Text Generation Analysis on
Question Answering Systems
From text to multimodal: a survey of adversarial example generation in question answering systems -
防御思路
生成管道中加"自动验证器"(QA 模型交叉验证)、内容过滤器(toxicity / safety classifier)、可控生成(约束 prompt / planning),以及对抗训练来提高鲁棒性。
Adversarial and Safely Scaled Question Generation
Planning First, Question Second: An LLM-Guided Method for Controllable Question Generation
https://www.kimi.com/chat/d30tvahdjjpv13ulon10
https://chat.deepseek.com/a/chat/s/09a1c365-03c7-4296-8546-99f27789331a
https://www.doubao.com/chat/19971271987401474
https://www.kimi.com/chat/d30u32le09n7a07jghi0
https://chatgpt.com/c/68c1e19a-20a8-8331-9cc6-b852094af4a3
https://www.doubao.com/chat/20004266728986114
https://www.kimi.com/chat/d316mldm2cimqrokh2e0
https://chatgpt.com/c/68c262d0-ffbc-832a-baf2-08647ad9b0cd
A Survey on Neural Question Generation: Methods, Applications,