大模型生成(题目)安全

总目录 大模型相关研究:https://blog.csdn.net/WhiffeYF/article/details/142132328

id 论文名 等级 期刊/会议
1 Larger and more instructable language models become less reliable 2024 1 nature
2 CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion 2024 A ACL
3 Advancing LLM Safe Alignment with Safety Representation Ranking 2025 ICML Workshop
4 Improved Generation of Adversarial Examples Against Safety-aligned LLMs 2024 A NeurIPS
5 From text to multimodal: a survey of adversarial example generation in question answering systems 2024 4 Knowledge and Information Systems
6 Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer? 2025 B NAACL
7 Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback 2025 B Information Processing & Management
8 Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations 2025 arxiv

https://www.doubao.com/chat/19972227671365378
https://chatgpt.com/c/68c1da62-7fc0-832e-9934-b6a7ef381fd9

题目生成安全研究重要(几点可直接用的研究思路)

  1. 攻击面(如何让模型生成"错误/偏见/有害"题)

    利用对抗生成方法(离散 token 替换、梯度启发式搜索、prompt-jailbreak、多示例诱导)来构造题目,使得生成结果包含事实错误、暗含偏见或鼓励危险行为。参考:NeurIPS 2024、对抗样本综述。
    Improved Generation of Adversarial Examples Against Safety-aligned LLMs
    From text to multimodal: a survey of adversarial example generation in question answering systems

  2. 生成-验证不一致(模型会生成自己也答不出来的题)

    设计"生成后验证"流程(生成题目 → 同/异模型验证是否可答 / 是否存在多个正确答案),并用该流程判定"错误性题目"。
    Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?

  3. 偏见题构造与检测

    构建偏见题模板(性别/种族/阶级/文化敏感话题),通过语法/语义变换扩展(借鉴 JADE 型方法),评估不同模型在题目生成时露出的系统性偏差。
    Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback
    Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

  4. 评估指标与数据集

    可用指标:错误率(事实/逻辑)、不可答率(unanswerable)、有害性评分(自动 + 人工标签)、偏见强度(差异化统计)、选项/答案位置偏置、可解释性度量等。
    RobustQA: A Framework for Adversarial Text Generation Analysis on
    Question Answering Systems

    From text to multimodal: a survey of adversarial example generation in question answering systems

  5. 防御思路

    生成管道中加"自动验证器"(QA 模型交叉验证)、内容过滤器(toxicity / safety classifier)、可控生成(约束 prompt / planning),以及对抗训练来提高鲁棒性。
    Adversarial and Safely Scaled Question Generation
    Planning First, Question Second: An LLM-Guided Method for Controllable Question Generation

https://www.kimi.com/chat/d30tvahdjjpv13ulon10

https://chat.deepseek.com/a/chat/s/09a1c365-03c7-4296-8546-99f27789331a

https://www.doubao.com/chat/19971271987401474

https://www.kimi.com/chat/d30u32le09n7a07jghi0

https://chatgpt.com/c/68c1e19a-20a8-8331-9cc6-b852094af4a3

https://www.doubao.com/chat/20004266728986114

https://www.kimi.com/chat/d316mldm2cimqrokh2e0

https://chatgpt.com/c/68c262d0-ffbc-832a-baf2-08647ad9b0cd

A Survey on Neural Question Generation: Methods, Applications,

相关推荐
AI_小站2 小时前
6个GitHub爆火的免费大模型教程,助你快速进阶AI编程
人工智能·langchain·github·知识图谱·agent·llama·rag
xindoo2 小时前
GitHub Trending霸榜!深度解析AI Coding辅助神器 Superpowers
人工智能·github
时间之里2 小时前
【深度学习】:RF-DETR与yolo对比
人工智能·深度学习·yolo
北京阿法龙科技有限公司2 小时前
数智化升级:AR 智能眼镜驱动工业运维效能革新
人工智能
风落无尘2 小时前
《智能重生:从垃圾堆到AI工程师》——第二章 概率与生存
大数据·人工智能
j_xxx404_2 小时前
Linux:静态链接与动态链接深度解析
linux·运维·服务器·c++·人工智能
收获不止数据库2 小时前
达梦9发布会归来:AI 时代,我们需要一款什么样的数据库?
数据库·人工智能·ai·语言模型·数据分析
hhb_6182 小时前
AI全栈编程生存指南
人工智能
AI-Frontiers2 小时前
transformer进阶之路:#2 工作原理详解
人工智能·深度学习·transformer
科研前沿3 小时前
2026 数字孪生前沿科技:全景迭代报告 —— 镜像视界生成式孪生(Generative DT)技术白皮书
大数据·人工智能·科技·算法·音视频·空间计算