【LangChain系列 13】样例选择器

原文地址:【LangChain系列 13】样例选择器

本文速读:

  • 自定义样例选择器

  • 长度选样例择器

  • MMR样例选择器

  • n-gram重叠度样例选择器

  • 相似度样例选择器

在上一篇【LangChain系列 8】Prompt模版------少样本prompt模版(二)中,介绍动态少样本prompt模版的时候,根据输入的内容,样例选择器 从所有样例中动态地选择部分样例,这部分样例与输入内容更加相关,这样语言模型能更好地理解prompt,从而给出更好的回答。

本文将介绍五种样例选择器的用法:

  • 自定义样例选择器

  • 长度样例选择器

  • MMR样例选择器

  • n-gram重叠度样例选择器

  • 相似度样例选择器

01 自定义样例选择器

自定义样例选择器是一种常见的操作,因为业务逻辑千变万化,通过自定义样例选择器可以更加灵活地选择选择样例。

一个样例选择器至少要实现两个方法:

  1. add_example方法:接收一个样例,然后把它传递给样例选择器。

  2. select_examples方法:接收用户输入变量,然后返回一个样例列表给 少样本prompt模版 使用。

下面我们来实现一个随机选择两个样例的选择器。

python 复制代码
from langchain.prompts.example_selector.base import BaseExampleSelector
from typing import Dict, List
import numpy as np


class CustomExampleSelector(BaseExampleSelector):
    
    def __init__(self, examples: List[Dict[str, str]]):
        self.examples = examples
    
    def add_example(self, example: Dict[str, str]) -> None:
        """Add new example to store for a key."""
        self.examples.append(example)

    def select_examples(self, input_variables: Dict[str, str]) -> List[dict]:
        """Select which examples to use based on the inputs."""
        return np.random.choice(self.examples, size=2, replace=False)

定义好选择器后,就可以使用它了。

css 复制代码
examples = [    {"foo": "1"},    {"foo": "2"},    {"foo": "3"}]

# Initialize example selector.
example_selector = CustomExampleSelector(examples)

# Select examples
example_selector.select_examples({"foo": "foo"})
# -> array([{'foo': '2'}, {'foo': '3'}], dtype=object)

# Add new example to the set of examples
example_selector.add_example({"foo": "4"})
example_selector.examples
# -> [{'foo': '1'}, {'foo': '2'}, {'foo': '3'}, {'foo': '4'}]

# Select examples
example_selector.select_examples({"foo": "foo"})
# -> array([{'foo': '1'}, {'foo': '4'}], dtype=object)

02 长度样例选择器

顾名思义,长度样例选择器就是根据样例的长度来选择样例,这适用于prompt过长会超过上下文窗口长度的情况。如果用户输入内容比较长,那么就会选择更少的样例,如果用户输入内容比较短,那么就会选择更多的样例。

LengthBasedExampleSelector是LangChain提供的长度选择器。

python 复制代码
from langchain.prompts import PromptTemplate
from langchain.prompts import FewShotPromptTemplate
from langchain.prompts.example_selector import LengthBasedExampleSelector


# Examples of a pretend task of creating antonyms.
examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)
example_selector = LengthBasedExampleSelector(
    # The examples it has available to choose from.
    examples=examples, 
    # The PromptTemplate being used to format the examples.
    example_prompt=example_prompt, 
    # The maximum length that the formatted examples should be.
    # Length is measured by the get_text_length function below.
    max_length=25,
    # The function used to get the length of a string, which is used
    # to determine which examples to include. It is commented out because
    # it is provided as a default value if none is specified.
    # get_text_length: Callable[[str], int] = lambda x: len(re.split("\n| ", x))
)
dynamic_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the antonym of every input",
    suffix="Input: {adjective}\nOutput:", 
    input_variables=["adjective"],
)

用户输入内容比较短时,选择了所有样例。

yaml 复制代码
# An example with small input, so it selects all examples.
print(dynamic_prompt.format(adjective="big"))

    Give the antonym of every input
    
    Input: happy
    Output: sad
    
    Input: tall
    Output: short
    
    Input: energetic
    Output: lethargic
    
    Input: sunny
    Output: gloomy
    
    Input: windy
    Output: calm
    
    Input: big
    Output:

用户输入较长时,只选择了一个样例。

ini 复制代码
# An example with long input, so it selects only one example.
long_string = "big and huge and massive and large and gigantic and tall and much much much much much bigger than everything else"
print(dynamic_prompt.format(adjective=long_string))

    Give the antonym of every input
    
    Input: happy
    Output: sad
    
    Input: big and huge and massive and large and gigantic and tall and much much much much much bigger than everything else
    Output:

同时,你可以动态增加样例。

yaml 复制代码
# You can add an example to an example selector as well.
new_example = {"input": "big", "output": "small"}
dynamic_prompt.example_selector.add_example(new_example)
print(dynamic_prompt.format(adjective="enthusiastic"))

    Give the antonym of every input
    
    Input: happy
    Output: sad
    
    Input: tall
    Output: short
    
    Input: energetic
    Output: lethargic
    
    Input: sunny
    Output: gloomy
    
    Input: windy
    Output: calm
    
    Input: big
    Output: small
    
    Input: enthusiastic
    Output:

03 MMR样例选择器

MMR(maximal marginal relevance)样例选择器的意思是:选择一组样例,既保证这些样例与用户输入是相似的,同时也要保证样例的多样性。它是如何做到的呢?主要从两个方面实现的:

  1. 相似度:通过embeddings计算样本和用户输入余弦相似度,从样本中选择相似度高的,从而保持选择的样本和用户输入是相似的。

  2. 多样性:当选择一个新样本加入时,如果它与已选择的样本很相似,那么会做一个惩罚计算,从而保证了多样性。

ini 复制代码
from langchain.prompts.example_selector import (
    MaxMarginalRelevanceExampleSelector,
    SemanticSimilarityExampleSelector,
)
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import FewShotPromptTemplate, PromptTemplate

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)

# Examples of a pretend task of creating antonyms.
examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},
]

example_selector = MaxMarginalRelevanceExampleSelector.from_examples(
    # The list of examples available to select from.
    examples,
    # The embedding class used to produce embeddings which are used to measure semantic similarity.
    OpenAIEmbeddings(),
    # The VectorStore class that is used to store the embeddings and do a similarity search over.
    FAISS,
    # The number of examples to produce.
    k=2,
)
mmr_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the antonym of every input",
    suffix="Input: {adjective}\nOutput:",
    input_variables=["adjective"],
)
# Input is a feeling, so should select the happy/sad example as the first one
print(mmr_prompt.format(adjective="worried"))

执行代码,输出结果:

yaml 复制代码
    Give the antonym of every input
    
    Input: happy
    Output: sad
    
    Input: windy
    Output: calm
    
    Input: worried
    Output:

04 相似度样例选择器

与MMR样例选择器不同,相似度选择器仅仅根据相似度来选择样例,LangChain提供了SemanticSimilarityExampleSelector可以直接使用,继续用MMR的样例,但选择器用 相似度样例选择器 时,我们看一下选择的样例有什么不同?

ini 复制代码
from langchain.prompts import PromptTemplate
from langchain.prompts.example_selector.ngram_overlap import NGramOverlapExampleSelector
from langchain.prompts import FewShotPromptTemplate, PromptTemplate

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)

# Examples of a pretend task of creating antonyms.
examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},
]

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)
example_selector = NGramOverlapExampleSelector(
    # The examples it has available to choose from.
    examples=examples,
    # The PromptTemplate being used to format the examples.
    example_prompt=example_prompt,
    # The threshold, at which selector stops.
    # It is set to -1.0 by default.
    threshold=-1.0,
    # For negative threshold:
    # Selector sorts examples by ngram overlap score, and excludes none.
    # For threshold greater than 1.0:
    # Selector excludes all examples, and returns an empty list.
    # For threshold equal to 0.0:
    # Selector sorts examples by ngram overlap score,
    # and excludes those with no ngram overlap with input.
)
dynamic_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the Spanish translation of every input",
    suffix="Input: {sentence}\nOutput:",
    input_variables=["sentence"],
)


# An example input with large ngram overlap with "Spot can run."
# and no overlap with "My dog barks."
print(dynamic_prompt.format(sentence="Spot can run fast."))

执行代码,输出结果:

yaml 复制代码
    Give the Spanish translation of every input
    
    Input: Spot can run.
    Output: Spot puede correr.
    
    Input: See Spot run.
    Output: Ver correr a Spot.
    
    Input: My dog barks.
    Output: Mi perro ladra.
    
    Input: Spot can run fast.
    Output:

由于阈值默认是-1.0,所以会选择所有样例,并排序;当然,你也可以动态增加样例:

ini 复制代码
# You can add examples to NGramOverlapExampleSelector as well.
new_example = {"input": "Spot plays fetch.", "output": "Spot juega a buscar."}

example_selector.add_example(new_example)
print(dynamic_prompt.format(sentence="Spot can run fast."))

把阈值设置为0时:

yaml 复制代码
# You can set a threshold at which examples are excluded.
# For example, setting threshold equal to 0.0
# excludes examples with no ngram overlaps with input.
# Since "My dog barks." has no ngram overlaps with "Spot can run fast."
# it is excluded.
example_selector.threshold = 0.0
print(dynamic_prompt.format(sentence="Spot can run fast."))

    Give the Spanish translation of every input
    
    Input: Spot can run.
    Output: Spot puede correr.
    
    Input: See Spot run.
    Output: Ver correr a Spot.
    
    Input: Spot plays fetch.
    Output: Spot juega a buscar.
    
    Input: Spot can run fast.
    Output:

当阈值设置为0.09时:

yaml 复制代码
# Setting small nonzero threshold
example_selector.threshold = 0.09
print(dynamic_prompt.format(sentence="Spot can play fetch."))

    Give the Spanish translation of every input
    
    Input: Spot can run.
    Output: Spot puede correr.
    
    Input: Spot plays fetch.
    Output: Spot juega a buscar.
    
    Input: Spot can play fetch.
    Output:

当阈值设置为大于1时:

ini 复制代码
# Setting threshold greater than 1.0
example_selector.threshold = 1.0 + 1e-9
print(dynamic_prompt.format(sentence="Spot can play fetch."))

    Give the Spanish translation of every input
    
    Input: Spot can play fetch.
    Output:

此时所有样例都会被排除。

本文小结

本文主要介绍了几种样例选择器的用法和区别,在不同的业务场景,我们可以选择合适的样例选择器来提高少样本prompt的质量。

相关推荐
开发者每周简报24 分钟前
求职市场变化
人工智能·面试·职场和发展
AI前沿技术追踪37 分钟前
OpenAI 12天发布会:AI革命的里程碑@附35页PDF文件下载
人工智能
余~~1853816280044 分钟前
稳定的碰一碰发视频、碰一碰矩阵源码技术开发,支持OEM
开发语言·人工智能·python·音视频
galileo20161 小时前
LLM与金融
人工智能
DREAM依旧2 小时前
隐马尔科夫模型|前向算法|Viterbi 算法
人工智能
GocNeverGiveUp2 小时前
机器学习2-NumPy
人工智能·机器学习·numpy
B站计算机毕业设计超人3 小时前
计算机毕业设计PySpark+Hadoop中国城市交通分析与预测 Python交通预测 Python交通可视化 客流量预测 交通大数据 机器学习 深度学习
大数据·人工智能·爬虫·python·机器学习·课程设计·数据可视化
学术头条3 小时前
清华、智谱团队:探索 RLHF 的 scaling laws
人工智能·深度学习·算法·机器学习·语言模型·计算语言学
18号房客3 小时前
一个简单的机器学习实战例程,使用Scikit-Learn库来完成一个常见的分类任务——**鸢尾花数据集(Iris Dataset)**的分类
人工智能·深度学习·神经网络·机器学习·语言模型·自然语言处理·sklearn
feifeikon3 小时前
机器学习DAY3 : 线性回归与最小二乘法与sklearn实现 (线性回归完)
人工智能·机器学习·线性回归