大模型生成(题目)安全

总目录 大模型相关研究:https://blog.csdn.net/WhiffeYF/article/details/142132328

id 论文名 等级 期刊/会议
1 Larger and more instructable language models become less reliable 2024 1 nature
2 CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion 2024 A ACL
3 Advancing LLM Safe Alignment with Safety Representation Ranking 2025 ICML Workshop
4 Improved Generation of Adversarial Examples Against Safety-aligned LLMs 2024 A NeurIPS
5 From text to multimodal: a survey of adversarial example generation in question answering systems 2024 4 Knowledge and Information Systems
6 Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer? 2025 B NAACL
7 Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback 2025 B Information Processing & Management
8 Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations 2025 arxiv

https://www.doubao.com/chat/19972227671365378
https://chatgpt.com/c/68c1da62-7fc0-832e-9934-b6a7ef381fd9

题目生成安全研究重要(几点可直接用的研究思路)

  1. 攻击面(如何让模型生成"错误/偏见/有害"题)

    利用对抗生成方法(离散 token 替换、梯度启发式搜索、prompt-jailbreak、多示例诱导)来构造题目,使得生成结果包含事实错误、暗含偏见或鼓励危险行为。参考:NeurIPS 2024、对抗样本综述。
    Improved Generation of Adversarial Examples Against Safety-aligned LLMs
    From text to multimodal: a survey of adversarial example generation in question answering systems

  2. 生成-验证不一致(模型会生成自己也答不出来的题)

    设计"生成后验证"流程(生成题目 → 同/异模型验证是否可答 / 是否存在多个正确答案),并用该流程判定"错误性题目"。
    Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?

  3. 偏见题构造与检测

    构建偏见题模板(性别/种族/阶级/文化敏感话题),通过语法/语义变换扩展(借鉴 JADE 型方法),评估不同模型在题目生成时露出的系统性偏差。
    Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback
    Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

  4. 评估指标与数据集

    可用指标:错误率(事实/逻辑)、不可答率(unanswerable)、有害性评分(自动 + 人工标签)、偏见强度(差异化统计)、选项/答案位置偏置、可解释性度量等。
    RobustQA: A Framework for Adversarial Text Generation Analysis on
    Question Answering Systems

    From text to multimodal: a survey of adversarial example generation in question answering systems

  5. 防御思路

    生成管道中加"自动验证器"(QA 模型交叉验证)、内容过滤器(toxicity / safety classifier)、可控生成(约束 prompt / planning),以及对抗训练来提高鲁棒性。
    Adversarial and Safely Scaled Question Generation
    Planning First, Question Second: An LLM-Guided Method for Controllable Question Generation

https://www.kimi.com/chat/d30tvahdjjpv13ulon10

https://chat.deepseek.com/a/chat/s/09a1c365-03c7-4296-8546-99f27789331a

https://www.doubao.com/chat/19971271987401474

https://www.kimi.com/chat/d30u32le09n7a07jghi0

https://chatgpt.com/c/68c1e19a-20a8-8331-9cc6-b852094af4a3

https://www.doubao.com/chat/20004266728986114

https://www.kimi.com/chat/d316mldm2cimqrokh2e0

https://chatgpt.com/c/68c262d0-ffbc-832a-baf2-08647ad9b0cd

A Survey on Neural Question Generation: Methods, Applications,

相关推荐
安徽必海微马春梅_6688A几秒前
A实验:大鼠脑定位仪 小鼠脑定位仪 大动物定位仪 小动物脑定位仪 资料说明。
人工智能·深度学习
aigcapi3 分钟前
2026 GPT/Gemini API接入优选指南+平台榜单:破解“GPT API哪个平台好”核心难题
人工智能·gpt·api
百胜软件@百胜软件3 分钟前
喜讯|百胜软件荣膺“2025年度零售科技最佳服务商”
大数据·人工智能
张祥6422889044 分钟前
误差理论与测量平差基础四
人工智能·机器学习·概率论
雨大王5127 分钟前
智能仓储系统在汽车零部件管理中的应用
人工智能·汽车·制造
神气龙10 分钟前
Dify试用
人工智能
WLJT12312312310 分钟前
品质配件与专业维保筑牢安全发展根基
大数据·人工智能·科技·安全·生活
深圳南柯电子10 分钟前
深圳南柯电子|EMC电磁兼容测试系统:5G时代应对频段的干扰挑战
网络·人工智能·互联网·实验室·emc
小郭团队14 分钟前
教育公平的探索
大数据·人工智能·嵌入式硬件·算法·硬件架构
驭白.15 分钟前
从订单到行驶:构建新能源汽车产品全生命周期数据链
人工智能·汽车·制造·数字化·制造业·新能源汽车