Elasticsearch:如何使用 LLM 在摄入数据时提取需要的信息

在很多的应用场景中,我们可以使用 LLM 来帮助我们提取需要的结构化数据。这些结构化的数据可以是分类,也可以是获取同义词等等。在我之前的文章 "如何自动化同义词并使用我们的 Synonyms API 进行上传" 里,我们展示了如何使用 LLM 来生成同义词,并上传到 Elasticsearch 中。在今天的例子里,我们把 LLM 提取数据的流程放到我们的 ingest pipeline 里。这样在摄入的同时,会自动提前所需要的信息!

创建 LLM Chat completion 端点

我们可以参考之前的文章 "Elasticsearch:使用推理端点及语义搜索演示"。我们可以创建一个如下的 chat completion 端点:

复制代码
PUT _inference/completion/azure_openai_completion
{
    "service": "azureopenai",
    "service_settings": {
        "api_key": "${AZURE_API_KEY}",
        "resource_name": "${AZURE_RESOURCE_NAME}",
        "deployment_id": "${AZURE_DEPLOYMENT_ID}",
        "api_version": "${AZURE_API_VERSION}"
    }
}

创建一个 ingest pipeline

我们可以使用如下的一个方法来测试 pipeline:

在上面,我们定义了一个 EXTRACTION_PROMPT 变量:

复制代码
Extract audio product information from this description.  Return raw JSON only. Do NOT use markdown, backticks, or code blocks. Fields: category (string, one of: Headphones/Earbuds/Speakers/Microphones/Accessories),  features (array of strings from: wireless/noise_cancellation/long_battery/waterproof/voice_assistant/fast_charging/portable/surround_sound),  use_case (string, one of: Travel/Office/Home/Fitness/Gaming/Studio).  Description:

如果你还不了解如何定义这个变量,请参考我之前的文章 "Kibana:如何设置变量并应用它们"。

复制代码
POST _ingest/pipeline/_simulate
{
  "description": "Use LLM to interpret messages to come out categories",
  "pipeline": {
    "processors": [
      {
        "script": {
          "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description",
          "params": {
            "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}"
          }
        }
      },
      {
        "inference": {
          "model_id": "azure_openai_completion",
          "input_output": {
            "input_field": "prompt",
            "output_field": "ai_response"
          }
        }
      },
      {
        "json": {
          "field": "ai_response",
          "add_to_root": true
        }
      },
      {
        "json": {
          "field": "ai_response",
          "add_to_root": true
        }
      },
      {
        "remove": {
          "field": [
            "prompt",
            "ai_response"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "name": "Wireless Noise-Canceling Headphones",
        "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
        "price": 299.99
      }
    }
  ]
}

提示:你可以使用任何一个你喜欢的大模型来创建上面的端点。

上面命令运行的结果就是:

复制代码
{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "use_case": "Travel",
          "features": [
            "wireless",
            "noise_cancellation",
            "long_battery"
          ],
          "price": 299.99,
          "name": "Wireless Noise-Canceling Headphones",
          "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
          "model_id": "azure_openai_completion",
          "category": "Headphones"
        },
        "_ingest": {
          "timestamp": "2026-01-22T13:56:11.926494Z"
        }
      }
    }
  ]
}

上面的测试非常成功。我们可以进一步创建 pipeline:

复制代码
PUT _ingest/pipeline/product-enrichment-pipeline
{
  "processors": [
    {
      "script": {
        "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description",
        "params": {
          "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}"
        }
      }
    },
    {
      "inference": {
        "model_id": "azure_openai_completion",
        "input_output": {
          "input_field": "prompt",
          "output_field": "ai_response"
        }
      }
    },
    {
      "json": {
        "field": "ai_response",
        "add_to_root": true
      }
    },
    {
      "json": {
        "field": "ai_response",
        "add_to_root": true
      }
    },
    {
      "remove": {
        "field": [
          "prompt",
          "ai_response"
        ]
      }
    }
  ]
}

创建索引并写入数据

我们接下来创建一个叫做 products 的索引:

复制代码
PUT products
{
  "settings": {
    "default_pipeline": "product-enrichment-pipeline"
  }
}

如上所示,我们把 default_pipeline,也即默认的 pipeline 设置为 product-enrichment-pipeline。这样我们像正常地写入数据的时候,这个 pipeline 也会被自动调用:

复制代码
POST _bulk
{ "index": { "_index": "products", "_id": "1" } }
{ "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "price": 299.99 }
{ "index": { "_index": "products", "_id": "2" } }
{ "name": "Portable Bluetooth Speaker", "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.", "price": 149.99 }
{ "index": { "_index": "products", "_id": "3" } }
{ "name": "Studio Condenser Microphone", "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.", "price": 199.99 }

注意:依赖于大模型的速度,上面的调用可能需要一点时间来完成!

如上所示,我们写入数据。我们使用如下的命令来查看我们的数据:

复制代码
GET products/_search?filter_path=**.hits

{
  "hits": {
    "hits": [
      {
        "_index": "products",
        "_id": "1",
        "_score": 1,
        "_source": {
          "use_case": "Travel",
          "features": [
            "wireless",
            "noise_cancellation",
            "long_battery"
          ],
          "price": 299.99,
          "name": "Wireless Noise-Canceling Headphones",
          "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
          "model_id": "azure_openai_completion",
          "category": "Headphones"
        }
      },
      {
        "_index": "products",
        "_id": "2",
        "_score": 1,
        "_source": {
          "use_case": "Travel",
          "features": [
            "waterproof",
            "surround_sound"
          ],
          "price": 149.99,
          "name": "Portable Bluetooth Speaker",
          "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.",
          "model_id": "azure_openai_completion",
          "category": "Speakers"
        }
      },
      {
        "_index": "products",
        "_id": "3",
        "_score": 1,
        "_source": {
          "use_case": "Studio",
          "features": [
            "noise_cancellation",
            "voice_assistant"
          ],
          "price": 199.99,
          "name": "Studio Condenser Microphone",
          "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.",
          "model_id": "azure_openai_completion",
          "category": "Microphones"
        }
      }
    ]
  }
}

有了如上所示的结构化数据,我们就可以针对我们的数据进行搜索或统计了。

祝大家学习愉快!

相关推荐
爱打代码的小林1 小时前
高阶opencv基础
人工智能·opencv·计算机视觉
才思喷涌的小书虫1 小时前
打破 3D 感知瓶颈:OVSeg3R 如何推动开集 3D 实例分割应用落地
人工智能·目标检测·计算机视觉·3d·具身智能·数据标注·图像标注
言之。1 小时前
2026 年 1 月 15 日 - 21 日国内外 AI 科技大事及热点深度整理报告
人工智能·科技
weisian1512 小时前
进阶篇-4-数学篇-3--深度解析AI中的向量概念:从生活到代码,一文吃透核心逻辑
人工智能·python·生活·向量
这儿有一堆花2 小时前
AI视频生成的底层逻辑与技术架构
人工智能·音视频
Fairy要carry2 小时前
面试-Encoder-Decoder预训练思路
人工智能
杭州泽沃电子科技有限公司2 小时前
“不速之客”的威胁:在线监测如何筑起抵御小动物的智能防线
人工智能·在线监测
MistaCloud2 小时前
Pytorch进阶训练技巧(二)之梯度层面的优化策略
人工智能·pytorch·python·深度学习
农夫山泉2号2 小时前
【rk】——rk3588推理获得logits
人工智能·rk3588·ppl