大模型生成(题目)安全

总目录 大模型相关研究:https://blog.csdn.net/WhiffeYF/article/details/142132328

id 论文名 等级 期刊/会议
1 Larger and more instructable language models become less reliable 2024 1 nature
2 CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion 2024 A ACL
3 Advancing LLM Safe Alignment with Safety Representation Ranking 2025 ICML Workshop
4 Improved Generation of Adversarial Examples Against Safety-aligned LLMs 2024 A NeurIPS
5 From text to multimodal: a survey of adversarial example generation in question answering systems 2024 4 Knowledge and Information Systems
6 Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer? 2025 B NAACL
7 Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback 2025 B Information Processing & Management
8 Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations 2025 arxiv

https://www.doubao.com/chat/19972227671365378
https://chatgpt.com/c/68c1da62-7fc0-832e-9934-b6a7ef381fd9

题目生成安全研究重要(几点可直接用的研究思路)

  1. 攻击面(如何让模型生成"错误/偏见/有害"题)

    利用对抗生成方法(离散 token 替换、梯度启发式搜索、prompt-jailbreak、多示例诱导)来构造题目,使得生成结果包含事实错误、暗含偏见或鼓励危险行为。参考:NeurIPS 2024、对抗样本综述。
    Improved Generation of Adversarial Examples Against Safety-aligned LLMs
    From text to multimodal: a survey of adversarial example generation in question answering systems

  2. 生成-验证不一致(模型会生成自己也答不出来的题)

    设计"生成后验证"流程(生成题目 → 同/异模型验证是否可答 / 是否存在多个正确答案),并用该流程判定"错误性题目"。
    Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?

  3. 偏见题构造与检测

    构建偏见题模板(性别/种族/阶级/文化敏感话题),通过语法/语义变换扩展(借鉴 JADE 型方法),评估不同模型在题目生成时露出的系统性偏差。
    Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback
    Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

  4. 评估指标与数据集

    可用指标:错误率(事实/逻辑)、不可答率(unanswerable)、有害性评分(自动 + 人工标签)、偏见强度(差异化统计)、选项/答案位置偏置、可解释性度量等。
    RobustQA: A Framework for Adversarial Text Generation Analysis on
    Question Answering Systems

    From text to multimodal: a survey of adversarial example generation in question answering systems

  5. 防御思路

    生成管道中加"自动验证器"(QA 模型交叉验证)、内容过滤器(toxicity / safety classifier)、可控生成(约束 prompt / planning),以及对抗训练来提高鲁棒性。
    Adversarial and Safely Scaled Question Generation
    Planning First, Question Second: An LLM-Guided Method for Controllable Question Generation

https://www.kimi.com/chat/d30tvahdjjpv13ulon10

https://chat.deepseek.com/a/chat/s/09a1c365-03c7-4296-8546-99f27789331a

https://www.doubao.com/chat/19971271987401474

https://www.kimi.com/chat/d30u32le09n7a07jghi0

https://chatgpt.com/c/68c1e19a-20a8-8331-9cc6-b852094af4a3

https://www.doubao.com/chat/20004266728986114

https://www.kimi.com/chat/d316mldm2cimqrokh2e0

https://chatgpt.com/c/68c262d0-ffbc-832a-baf2-08647ad9b0cd

A Survey on Neural Question Generation: Methods, Applications,

相关推荐
悟纤几秒前
学习与专注音乐流派 (Study & Focus Music):AI 音乐创作终极指南 | Suno高级篇 | 第33篇
大数据·人工智能·深度学习·学习·suno·suno api
饭饭大王6661 分钟前
迈向智能体时代——构建基于 `ops-transformer` 的可持续 AI 系统
人工智能·深度学习·transformer
晚霞的不甘1 分钟前
CANN 支持强化学习:从 Isaac Gym 仿真到机械臂真机控制
人工智能·神经网络·架构·开源·音视频
哈__12 分钟前
CANN加速Image-to-Image转换:风格迁移与图像编辑优化
人工智能·计算机视觉
ujainu12 分钟前
解码昇腾AI的“中枢神经”:CANN开源仓库全景式技术解析
人工智能·开源·cann
Elastic 中国社区官方博客18 分钟前
Elasticsearch:Workflows 介绍 - 9.3
大数据·数据库·人工智能·elasticsearch·ai·全文检索
组合缺一18 分钟前
Solon AI (Java) v3.9 正式发布:全能 Skill 爆发,Agent 协作更专业!仍然支持 java8!
java·人工智能·ai·llm·agent·solon·mcp
哈__19 分钟前
CANN: AI 生态的异构计算核心,从架构到实战全解析
人工智能·架构
熊猫钓鱼>_>22 分钟前
移动端开发技术选型报告:三足鼎立时代的开发者指南(2026年2月)
android·人工智能·ios·app·鸿蒙·cpu·移动端
想你依然心痛26 分钟前
ModelEngine·AI 应用开发实战:从智能体到可视化编排的全栈实践
人工智能·智能体·ai应用·modelengine