大模型生成(题目)安全

总目录 大模型相关研究:https://blog.csdn.net/WhiffeYF/article/details/142132328

id 论文名 等级 期刊/会议
1 Larger and more instructable language models become less reliable 2024 1 nature
2 CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion 2024 A ACL
3 Advancing LLM Safe Alignment with Safety Representation Ranking 2025 ICML Workshop
4 Improved Generation of Adversarial Examples Against Safety-aligned LLMs 2024 A NeurIPS
5 From text to multimodal: a survey of adversarial example generation in question answering systems 2024 4 Knowledge and Information Systems
6 Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer? 2025 B NAACL
7 Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback 2025 B Information Processing & Management
8 Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations 2025 arxiv

https://www.doubao.com/chat/19972227671365378
https://chatgpt.com/c/68c1da62-7fc0-832e-9934-b6a7ef381fd9

题目生成安全研究重要(几点可直接用的研究思路)

  1. 攻击面(如何让模型生成"错误/偏见/有害"题)

    利用对抗生成方法(离散 token 替换、梯度启发式搜索、prompt-jailbreak、多示例诱导)来构造题目,使得生成结果包含事实错误、暗含偏见或鼓励危险行为。参考:NeurIPS 2024、对抗样本综述。
    Improved Generation of Adversarial Examples Against Safety-aligned LLMs
    From text to multimodal: a survey of adversarial example generation in question answering systems

  2. 生成-验证不一致(模型会生成自己也答不出来的题)

    设计"生成后验证"流程(生成题目 → 同/异模型验证是否可答 / 是否存在多个正确答案),并用该流程判定"错误性题目"。
    Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?

  3. 偏见题构造与检测

    构建偏见题模板(性别/种族/阶级/文化敏感话题),通过语法/语义变换扩展(借鉴 JADE 型方法),评估不同模型在题目生成时露出的系统性偏差。
    Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback
    Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

  4. 评估指标与数据集

    可用指标:错误率(事实/逻辑)、不可答率(unanswerable)、有害性评分(自动 + 人工标签)、偏见强度(差异化统计)、选项/答案位置偏置、可解释性度量等。
    RobustQA: A Framework for Adversarial Text Generation Analysis on
    Question Answering Systems

    From text to multimodal: a survey of adversarial example generation in question answering systems

  5. 防御思路

    生成管道中加"自动验证器"(QA 模型交叉验证)、内容过滤器(toxicity / safety classifier)、可控生成(约束 prompt / planning),以及对抗训练来提高鲁棒性。
    Adversarial and Safely Scaled Question Generation
    Planning First, Question Second: An LLM-Guided Method for Controllable Question Generation

https://www.kimi.com/chat/d30tvahdjjpv13ulon10

https://chat.deepseek.com/a/chat/s/09a1c365-03c7-4296-8546-99f27789331a

https://www.doubao.com/chat/19971271987401474

https://www.kimi.com/chat/d30u32le09n7a07jghi0

https://chatgpt.com/c/68c1e19a-20a8-8331-9cc6-b852094af4a3

https://www.doubao.com/chat/20004266728986114

https://www.kimi.com/chat/d316mldm2cimqrokh2e0

https://chatgpt.com/c/68c262d0-ffbc-832a-baf2-08647ad9b0cd

A Survey on Neural Question Generation: Methods, Applications,

相关推荐
程序员鱼皮18 分钟前
斯坦福大学竟然开了个 AI 编程课?!我已经学上了
人工智能·ai编程
星浩AI1 小时前
Skill 的核心要素与渐进式加载架构——如何设计一个生产可用的 Skill?
人工智能·agent
树獭非懒1 小时前
告别繁琐多端开发:DivKit 带你玩转 Server-Driven UI!
android·前端·人工智能
阿尔的代码屋1 小时前
[大模型实战 07] 基于 LlamaIndex ReAct 框架手搓全自动博客监控 Agent
人工智能·python
小小小怪兽1 小时前
🔨聊一聊Skills
人工智能·agent
穿过生命散发芬芳1 小时前
OpenClaw:开启OpenCloudOS 操作系统智能运维初体验
人工智能·aigc
老金带你玩AI2 小时前
Claude Code自动记忆来了!配合老金三层记忆系统全开源!加强Plus!
人工智能
Halo咯咯2 小时前
无限免费 OpenClaw:接入本地模型后,你的 AI Agent 就可以 24 小时自动干活(Mac Mini 可用)
人工智能
NAGNIP14 小时前
一文搞懂深度学习中的通用逼近定理!
人工智能·算法·面试
冬奇Lab15 小时前
一天一个开源项目(第36篇):EverMemOS - 跨 LLM 与平台的长时记忆 OS,让 Agent 会记忆更会推理
人工智能·开源·资讯