LangChain4j 使用 Elasticsearch 作为嵌入存储

作者：来自 Elastic David Pilato

LangChain4j（Java 版 LangChain）将 Elasticsearch 作为嵌入存储。了解如何使用它以纯 Java 构建 RAG 应用程序。

在上一篇文章中，我们发现了 LangChain4j 是什么以及如何：

通过使用 y 和 z 实现 x 与 LLM 进行讨论
在内存中保留聊天记录以回忆之前与 LLM 讨论的上下文

这篇博文介绍了如何：

从文本示例创建向量嵌入
将向量嵌入存储在 Elasticsearch 嵌入存储中
搜索相似的向量

创建嵌入

要创建嵌入，我们需要定义要使用的 EmbeddingModel。例如，我们可以使用上一篇文章中使用的相同 mistral 模型。它与 ollama 一起运行：

EmbeddingModel model = OllamaEmbeddingModel.builder()
  .baseUrl(ollama.getEndpoint())
  .modelName(MODEL_NAME)
  .build();

模型能够从文本生成向量。在这里我们可以检查模型生成的维数：

Logger.info("Embedding model has {} dimensions.", model.dimension());
// This gives: Embedding model has 4096 dimensions.

要从文本生成向量，我们可以使用：

Response<Embedding> response = model.embed("A text here");

或者，如果我们还想提供元数据，以便我们过滤文本、价格、发布日期等内容，我们可以使用 Metadata.from()。例如，我们在这里将游戏名称添加为元数据字段：

TextSegment game1 = TextSegment.from("""
    The game starts off with the main character Guybrush Threepwood stating "I want to be a pirate!"
    To do so, he must prove himself to three old pirate captains. During the perilous pirate trials, 
    he meets the beautiful governor Elaine Marley, with whom he falls in love, unaware that the ghost pirate 
    LeChuck also has his eyes on her. When Elaine is kidnapped, Guybrush procures crew and ship to track 
    LeChuck down, defeat him and rescue his love.
""", Metadata.from("gameName", "The Secret of Monkey Island"));
Response<Embedding> response1 = model.embed(game1);
TextSegment game2 = TextSegment.from("""
    Out Run is a pseudo-3D driving video game in which the player controls a Ferrari Testarossa 
    convertible from a third-person rear perspective. The camera is placed near the ground, simulating 
    a Ferrari driver's position and limiting the player's view into the distance. The road curves, 
    crests, and dips, which increases the challenge by obscuring upcoming obstacles such as traffic 
    that the player must avoid. The object of the game is to reach the finish line against a timer.
    The game world is divided into multiple stages that each end in a checkpoint, and reaching the end 
    of a stage provides more time. Near the end of each stage, the track forks to give the player a 
    choice of routes leading to five final destinations. The destinations represent different 
    difficulty levels and each conclude with their own ending scene, among them the Ferrari breaking 
    down or being presented a trophy.
""", Metadata.from("gameName", "Out Run"));
Response<Embedding> response2 = model.embed(game2);

如果你想运行此代码，请查看 Step5EmbedddingsTest.java 类。

添加 Elasticsearch 来存储我们的向量

LangChain4j 提供内存嵌入存储。这对于运行简单测试很有用：

EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
embeddingStore.add(response1.content(), game1);
embeddingStore.add(response2.content(), game2);

但显然，这不适用于更大的数据集，因为这个数据存储将所有内容都存储在内存中，而我们的服务器上没有无限的内存。因此，我们可以将嵌入存储到 Elasticsearch 中，从定义上讲，Elasticsearch 是 "弹性的"，可以根据你的数据进行扩展和扩展。为此，让我们将 Elasticsearch 添加到我们的项目中：

<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-elasticsearch</artifactId>
  <version>${langchain4j.version}</version>
</dependency>

<dependency>
  <groupId>org.testcontainers</groupId>
  <artifactId>elasticsearch</artifactId>
  <version>1.20.1</version>
  <scope>test</scope>
</dependency>

正如你所注意到的，我们还将 Elasticsearch TestContainers 模块添加到项目中，因此我们可以从测试中启动 Elasticsearch 实例：

// Create the elasticsearch container
ElasticsearchContainer container =
  new ElasticsearchContainer("docker.elastic.co/elasticsearch/elasticsearch:8.15.0")
    .withPassword("changeme");

// Start the container. This step might take some time...
container.start();

// As we don't want to make our TestContainers code more complex than
// needed, we will use login / password for authentication.
// But note that you can also use API keys which is preferred.
final CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("elastic", "changeme"));

// Create a low level Rest client which connects to the elasticsearch container.
client = RestClient.builder(HttpHost.create("https://" + container.getHttpHostAddress()))
  .setHttpClientConfigCallback(httpClientBuilder -> {
    httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider);
    httpClientBuilder.setSSLContext(container.createSslContextFromCa());
    return httpClientBuilder;
  })
  .build();

// Check the cluster is running
client.performRequest(new Request("GET", "/"));

要将 Elasticsearch 用作嵌入存储，你 "只需" 从 LangChain4j 内存数据存储切换到 Elasticsearch 数据存储：

EmbeddingStore<TextSegment> embeddingStore =
  ElasticsearchEmbeddingStore.builder()
    .restClient(client)
    .build();
embeddingStore.add(response1.content(), game1);
embeddingStore.add(response2.content(), game2);

这会将你的向量存储在 Elasticsearch 的默认索引中。你还可以将索引名称更改为更有意义的名称：

EmbeddingStore<TextSegment> embeddingStore =
  ElasticsearchEmbeddingStore.builder()
    .indexName("games")
    .restClient(client)
    .build();
embeddingStore.add(response1.content(), game1);
embeddingStore.add(response2.content(), game2);

如果你想运行此代码，请查看 Step6ElasticsearchEmbedddingsTest.java 类。

搜索相似向量

要搜索相似向量，我们首先需要使用我们之前使用的相同模型将问题转换为向量表示。我们已经这样做了，所以再次这样做并不难。请注意，在这种情况下我们不需要元数据：

String question = "I want to pilot a car";
Embedding questionAsVector = model.embed(question).content();

我们可以用这个问题的表示来构建一个搜索请求，并要求嵌入存储找到第一个顶部向量：

EmbeddingSearchResult<TextSegment> result = embeddingStore.search(
  EmbeddingSearchRequest.builder()
    .queryEmbedding(questionAsVector)
    .build());

我们现在可以迭代结果并打印一些信息，例如来自元数据和分数的游戏名称：

result.matches().forEach(m -> Logger.info("{} - score [{}]",
  m.embedded().metadata().getString("gameName"), m.score()));

正如我们所料，第一个结果就是 "Out Run"：

Out Run - score [0.86672974]
The Secret of Monkey Island - score [0.85569763]

如果你想运行此代码，请查看 Step7SearchForVectorsTest.java 类。

幕后

Elasticsearch Embedding 存储的默认配置是在幕后使用近似 kNN 查询。

POST games/_search
{
  "query" : {
    "knn": {
      "field": "vector",
      "query_vector": [-0.019137882, /* ... */, -0.0148779955]
    }
  }
}

但是，可以通过向嵌入存储提供默认配置（ElasticsearchConfigurationKnn）以外的另一个配置（ElasticsearchConfigurationScript）来改变这种情况：

EmbeddingStore<TextSegment> embeddingStore =
  ElasticsearchEmbeddingStore.builder()
    .configuration(ElasticsearchConfigurationScript.builder().build())
    .indexName("games")
    .restClient(client)
    .build();

ElasticsearchConfigurationScript 实现在后台使用 cosineSimilarity 函数运行 script_score 查询。

基本上，在调用时：

EmbeddingSearchResult<TextSegment> result = embeddingStore.search(
  EmbeddingSearchRequest.builder()
    .queryEmbedding(questionAsVector)
    .build());

现在调用：

POST games/_search
{
  "query": {
    "script_score": {
      "script": {
        "source": "(cosineSimilarity(params.query_vector, 'vector') + 1.0) / 2",
        "params": {
          "queryVector": [-0.019137882, /* ... */, -0.0148779955]
        }
      }
    }
  }
}

在这种情况下，结果在 "顺序" 方面不会改变，而只是调整分数，因为 cosineSimilarity 调用不使用任何近似值，而是计算每个匹配向量的余弦：

Out Run - score [0.871952]
The Secret of Monkey Island - score [0.86380446]

如果你想运行此代码，请查看 Step7SearchForVectorsTest.java 类。

结论

我们已经介绍了如何轻松地从文本生成嵌入，以及如何使用两种不同的方法在 Elasticsearch 中存储和搜索最近的邻居：

使用默认 ElasticsearchConfigurationKnn 选项的近似和快速 knn 查询
使用 ElasticsearchConfigurationScript 选项的精确但较慢的 script_score 查询

下一步将根据我们在这里学到的知识构建一个完整的 RAG 应用程序。

准备好自己尝试一下了吗？开始免费试用。

Elasticsearch 集成了 LangChain、Cohere 等工具。加入我们的高级语义搜索网络研讨会，构建你的下一个 GenAI 应用程序！

原文：LangChain4j with Elasticsearch as the embedding store --- Search Labs