大模型生成（题目）安全

总目录大模型相关研究：https://blog.csdn.net/WhiffeYF/article/details/142132328

id	论文名	年	等级	期刊/会议
1	Larger and more instructable language models become less reliable	2024	1	nature
2	CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion	2024	A	ACL
3	Advancing LLM Safe Alignment with Safety Representation Ranking	2025	无	ICML Workshop
4	Improved Generation of Adversarial Examples Against Safety-aligned LLMs	2024	A	NeurIPS
5	From text to multimodal: a survey of adversarial example generation in question answering systems	2024	4	Knowledge and Information Systems
6	Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?	2025	B	NAACL
7	Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback	2025	B	Information Processing & Management
8	Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations	2025	无	arxiv

https://www.doubao.com/chat/19972227671365378
https://chatgpt.com/c/68c1da62-7fc0-832e-9934-b6a7ef381fd9

题目生成安全研究重要（几点可直接用的研究思路）

攻击面（如何让模型生成"错误/偏见/有害"题）

利用对抗生成方法（离散 token 替换、梯度启发式搜索、prompt-jailbreak、多示例诱导）来构造题目，使得生成结果包含事实错误、暗含偏见或鼓励危险行为。参考：NeurIPS 2024、对抗样本综述。
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
From text to multimodal: a survey of adversarial example generation in question answering systems
生成-验证不一致（模型会生成自己也答不出来的题）

设计"生成后验证"流程（生成题目 → 同/异模型验证是否可答 / 是否存在多个正确答案），并用该流程判定"错误性题目"。
Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?
偏见题构造与检测

构建偏见题模板（性别/种族/阶级/文化敏感话题），通过语法/语义变换扩展（借鉴 JADE 型方法），评估不同模型在题目生成时露出的系统性偏差。
Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
评估指标与数据集

可用指标：错误率（事实/逻辑）、不可答率（unanswerable）、有害性评分（自动 + 人工标签）、偏见强度（差异化统计）、选项/答案位置偏置、可解释性度量等。
RobustQA: A Framework for Adversarial Text Generation Analysis on
Question Answering Systems
From text to multimodal: a survey of adversarial example generation in question answering systems
防御思路

生成管道中加"自动验证器"（QA 模型交叉验证）、内容过滤器（toxicity / safety classifier）、可控生成（约束 prompt / planning），以及对抗训练来提高鲁棒性。
Adversarial and Safely Scaled Question Generation
Planning First, Question Second: An LLM-Guided Method for Controllable Question Generation

https://www.kimi.com/chat/d30tvahdjjpv13ulon10

https://chat.deepseek.com/a/chat/s/09a1c365-03c7-4296-8546-99f27789331a

https://www.doubao.com/chat/19971271987401474

https://www.kimi.com/chat/d30u32le09n7a07jghi0

https://chatgpt.com/c/68c1e19a-20a8-8331-9cc6-b852094af4a3

https://www.doubao.com/chat/20004266728986114

https://www.kimi.com/chat/d316mldm2cimqrokh2e0

https://chatgpt.com/c/68c262d0-ffbc-832a-baf2-08647ad9b0cd

A Survey on Neural Question Generation: Methods, Applications,