大模型生成(题目)安全

总目录 大模型相关研究:https://blog.csdn.net/WhiffeYF/article/details/142132328

id 论文名 等级 期刊/会议
1 Larger and more instructable language models become less reliable 2024 1 nature
2 CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion 2024 A ACL
3 Advancing LLM Safe Alignment with Safety Representation Ranking 2025 ICML Workshop
4 Improved Generation of Adversarial Examples Against Safety-aligned LLMs 2024 A NeurIPS
5 From text to multimodal: a survey of adversarial example generation in question answering systems 2024 4 Knowledge and Information Systems
6 Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer? 2025 B NAACL
7 Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback 2025 B Information Processing & Management
8 Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations 2025 arxiv

https://www.doubao.com/chat/19972227671365378
https://chatgpt.com/c/68c1da62-7fc0-832e-9934-b6a7ef381fd9

题目生成安全研究重要(几点可直接用的研究思路)

  1. 攻击面(如何让模型生成"错误/偏见/有害"题)

    利用对抗生成方法(离散 token 替换、梯度启发式搜索、prompt-jailbreak、多示例诱导)来构造题目,使得生成结果包含事实错误、暗含偏见或鼓励危险行为。参考:NeurIPS 2024、对抗样本综述。
    Improved Generation of Adversarial Examples Against Safety-aligned LLMs
    From text to multimodal: a survey of adversarial example generation in question answering systems

  2. 生成-验证不一致(模型会生成自己也答不出来的题)

    设计"生成后验证"流程(生成题目 → 同/异模型验证是否可答 / 是否存在多个正确答案),并用该流程判定"错误性题目"。
    Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?

  3. 偏见题构造与检测

    构建偏见题模板(性别/种族/阶级/文化敏感话题),通过语法/语义变换扩展(借鉴 JADE 型方法),评估不同模型在题目生成时露出的系统性偏差。
    Towards human-like questioning: Knowledge base question generation with bias-corrected reinforcement learning from human feedback
    Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

  4. 评估指标与数据集

    可用指标:错误率(事实/逻辑)、不可答率(unanswerable)、有害性评分(自动 + 人工标签)、偏见强度(差异化统计)、选项/答案位置偏置、可解释性度量等。
    RobustQA: A Framework for Adversarial Text Generation Analysis on
    Question Answering Systems

    From text to multimodal: a survey of adversarial example generation in question answering systems

  5. 防御思路

    生成管道中加"自动验证器"(QA 模型交叉验证)、内容过滤器(toxicity / safety classifier)、可控生成(约束 prompt / planning),以及对抗训练来提高鲁棒性。
    Adversarial and Safely Scaled Question Generation
    Planning First, Question Second: An LLM-Guided Method for Controllable Question Generation

https://www.kimi.com/chat/d30tvahdjjpv13ulon10

https://chat.deepseek.com/a/chat/s/09a1c365-03c7-4296-8546-99f27789331a

https://www.doubao.com/chat/19971271987401474

https://www.kimi.com/chat/d30u32le09n7a07jghi0

https://chatgpt.com/c/68c1e19a-20a8-8331-9cc6-b852094af4a3

https://www.doubao.com/chat/20004266728986114

https://www.kimi.com/chat/d316mldm2cimqrokh2e0

https://chatgpt.com/c/68c262d0-ffbc-832a-baf2-08647ad9b0cd

A Survey on Neural Question Generation: Methods, Applications,

相关推荐
robot_learner几秒前
OpenClaw, 突然走红的智能体
人工智能
ujainu小几秒前
CANN仓库内容深度解读:昇腾AI生态的基石与AIGC发展的引擎
人工智能·aigc
rcc86282 分钟前
AI应用核心技能:从入门到精通的实战指南
人工智能·机器学习
霖大侠7 分钟前
【无标题】
人工智能·深度学习·机器学习
callJJ16 分钟前
Spring AI 文本聊天模型完全指南:ChatModel 与 ChatClient
java·大数据·人工智能·spring·spring ai·聊天模型
是店小二呀31 分钟前
CANN 异构计算的极限扩展:从算子融合到多卡通信的统一优化策略
人工智能·深度学习·transformer
冻感糕人~35 分钟前
收藏备用|小白&程序员必看!AI Agent入门详解(附工业落地实操关联)
大数据·人工智能·架构·大模型·agent·ai大模型·大模型学习
予枫的编程笔记38 分钟前
【Linux入门篇】Ubuntu和CentOS包管理不一样?apt与yum对比实操,看完再也不混淆
linux·人工智能·ubuntu·centos·linux包管理·linux新手教程·rpm离线安装
陈西子在网上冲浪38 分钟前
当全国人民用 AI 点奶茶时,你的企业官网还在“人工建站”吗?
人工智能
victory043141 分钟前
hello_agent第九章总结
人工智能·agent