【Datawhale 大模型基础】第三章 大型语言模型的有害性(危害)

第三章 大型语言模型的有害性(危害)

As illustrated aforementioned, LLMs have unique abilities that present only when the model have huge parameters. However, there are also some harms in LLMs.

When considering any technology, we must carefully weigh its benefits and harms. This is a complex task for three reasons:

  1. Benefits and harms are difficult to quantify;
  2. Even if they could be quantified, the distribution of these benefits and harms among the population is not uniform (marginalized groups often bear more harm), making the balancing act between them a thorny ethical issue;
  3. Even if you can make meaningful trade-offs, what authority do decision-makers have to make decisions?

Preventing of LLMs' harmfulness is still a very new research direction. The current content focuses mainly on the following two points:

  1. Harm related to performance differences: For specific tasks (such as question answering), performance differences mean that the model performs better in some groups and worse in others.
  2. Harm related to social biases and stereotypes: Social bias is the systematic association of a concept (such as science) with certain groups (such as men) over others (such as women). Stereotypes are a specific and widely held form of social bias in which the associations are widely held, oversimplified, and generally fixed.

Due to the opacity of pre-training datasets for LLMs and their inclusion of web-crawled data, it is likely that they contain online discussions encompassing political topics (e.g., climate change, abortion, gun control), hate speech, discrimination, and other forms of media bias. Some researchers have identified misogyny, pornography, and other harmful stereotypes within these pre-training datasets. Similarly, researchers **have observed that LLMs exhibit political biases that exacerbate the existing polarization in the pre-training corpora, thereby perpetuating societal biases in the prediction of hate speech and the detection of misinformation.

Recent studies have delved into the potential sources of biases in LLMs (such as training data or model specifications), the ethical concerns associated with deploying biased LLMs in diverse applications, and the current methods for mitigating these biases. An interesting find is that all models exhibit systematic preferences for stereotype data, showing that there is an eager need to establish a high-quality pre-training database.

Toxicity and disinformation are two key harms that all the researchers concern. In the context of toxicity and disinformation, LLMs can be served as two purposes:

  1. They can be used to generate toxic content, which malicious actors can exploit to amplify their information dissemination;
  2. They can be used to detect disinformation, thereby aiding in content moderation.

The challenge of identifying toxicity lies in the ambiguity of labeling, where the output may be toxic in one context but not in others, and different individuals may have varying perceptions of toxicity. Jigsaw, a division of Google, focuses on using technology to address societal issues, such as extremism. In 2017, they developed a widely popular proprietary service called Perspective, which is a machine learning model that assigns a toxicity score between 0 and 1 to each input. This model was trained on discussion pages on Wikipedia (where volunteer moderators discuss editing decisions) and labeled by crowdworkers. And the website is: https://perspectiveapi.com/.

For disinformation, it is the deliberate presentation of false or misleading information to deceive a specific audience, often with an adversarial intent. Another similar noun is misinformation (can be considered as "hallucinations"), which refers to information that is misleadingly presented as true. It is important to note that misleading and false information is not always verifiable; at times, it may raise doubts or shift the burden of proof onto the audience.

A recent research hotspot is hallucinations. To differentiate between various types of hallucinations, the given source content of the model can be analyzed, such as the prompt, potentially containing examples or retrieved context. There are two types of hallucinations: intrinsic and extrinsic hallucinations. In the former, the generated text logically contradicts the source content. In the latter, users are unable to verify the accuracy of the output based on the provided source; the source content lacks sufficient information to evaluate the output, making it undetermined. Extrinsic hallucination is not necessarily erroneous, as it simply means the model produced an output that cannot be supported or refuted by the source content. However, this is still somewhat undesirable as the provided information cannot be verified.

To better compare the difference between them, I cite a figure from a survey:

p.s. Recently I find some insteresting paper that discuss abilities about LLMs, maybe I will make notes in Chinese after finishing datawhale study.

END

相关推荐
X54先生(人文科技)4 分钟前
20260212_Meta-CreationPower_Development_Log(启蒙灯塔起源团队开发日志)
人工智能·机器学习·架构·团队开发·零知识证明
ViiTor_AI5 分钟前
视频字幕怎么去除?5 种方法删除硬编码字幕与软字幕(CapCut 实操)
人工智能·计算机视觉·音视频
咚咚王者5 分钟前
人工智能之视觉领域 计算机视觉 第三章 NumPy 与图像矩阵
人工智能·计算机视觉·numpy
天天进步20156 分钟前
赋予 AI “手”的能力:使用 OpenClaw 自动化执行 Shell 脚本与浏览器任务
人工智能
百度智能云技术站7 分钟前
百度百舸 Day0 完成昆仑芯和智谱 GLM-5 适配,实现「发布即可用」
人工智能·开源·vllm·百度百舸
曦云沐7 分钟前
第六篇:LangChain 1.0 消息系统与 Prompt 工程:从入门到精通的完整教程
人工智能·langchain·prompt·大模型开发框架
格林威8 分钟前
Baumer相机玻璃纤维布经纬密度测量:用于复合材料工艺控制的 6 个核心方法,附 OpenCV+Halcon 实战代码!
人工智能·opencv·计算机视觉·视觉检测·工业相机·智能相机·堡盟相机
康康的AI博客8 分钟前
AI大模型支持下的企业智能化转型:优化任务分配与文档自动化的最佳实践
大数据·人工智能·自动化
郝学胜-神的一滴11 分钟前
贝叶斯之美:从公式到朴素贝叶斯算法的实践之旅
人工智能·python·算法·机器学习·scikit-learn