Elasticsearch:如何使用 LLM 在摄入数据时提取需要的信息

在很多的应用场景中,我们可以使用 LLM 来帮助我们提取需要的结构化数据。这些结构化的数据可以是分类,也可以是获取同义词等等。在我之前的文章 "如何自动化同义词并使用我们的 Synonyms API 进行上传" 里,我们展示了如何使用 LLM 来生成同义词,并上传到 Elasticsearch 中。在今天的例子里,我们把 LLM 提取数据的流程放到我们的 ingest pipeline 里。这样在摄入的同时,会自动提前所需要的信息!

创建 LLM Chat completion 端点

我们可以参考之前的文章 "Elasticsearch:使用推理端点及语义搜索演示"。我们可以创建一个如下的 chat completion 端点:

复制代码
PUT _inference/completion/azure_openai_completion
{
    "service": "azureopenai",
    "service_settings": {
        "api_key": "${AZURE_API_KEY}",
        "resource_name": "${AZURE_RESOURCE_NAME}",
        "deployment_id": "${AZURE_DEPLOYMENT_ID}",
        "api_version": "${AZURE_API_VERSION}"
    }
}

创建一个 ingest pipeline

我们可以使用如下的一个方法来测试 pipeline:

在上面,我们定义了一个 EXTRACTION_PROMPT 变量:

复制代码
Extract audio product information from this description.  Return raw JSON only. Do NOT use markdown, backticks, or code blocks. Fields: category (string, one of: Headphones/Earbuds/Speakers/Microphones/Accessories),  features (array of strings from: wireless/noise_cancellation/long_battery/waterproof/voice_assistant/fast_charging/portable/surround_sound),  use_case (string, one of: Travel/Office/Home/Fitness/Gaming/Studio).  Description:

如果你还不了解如何定义这个变量,请参考我之前的文章 "Kibana:如何设置变量并应用它们"。

复制代码
POST _ingest/pipeline/_simulate
{
  "description": "Use LLM to interpret messages to come out categories",
  "pipeline": {
    "processors": [
      {
        "script": {
          "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description",
          "params": {
            "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}"
          }
        }
      },
      {
        "inference": {
          "model_id": "azure_openai_completion",
          "input_output": {
            "input_field": "prompt",
            "output_field": "ai_response"
          }
        }
      },
      {
        "json": {
          "field": "ai_response",
          "add_to_root": true
        }
      },
      {
        "json": {
          "field": "ai_response",
          "add_to_root": true
        }
      },
      {
        "remove": {
          "field": [
            "prompt",
            "ai_response"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "name": "Wireless Noise-Canceling Headphones",
        "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
        "price": 299.99
      }
    }
  ]
}

提示:你可以使用任何一个你喜欢的大模型来创建上面的端点。

上面命令运行的结果就是:

复制代码
{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_version": "-3",
        "_id": "_id",
        "_source": {
          "use_case": "Travel",
          "features": [
            "wireless",
            "noise_cancellation",
            "long_battery"
          ],
          "price": 299.99,
          "name": "Wireless Noise-Canceling Headphones",
          "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
          "model_id": "azure_openai_completion",
          "category": "Headphones"
        },
        "_ingest": {
          "timestamp": "2026-01-22T13:56:11.926494Z"
        }
      }
    }
  ]
}

上面的测试非常成功。我们可以进一步创建 pipeline:

复制代码
PUT _ingest/pipeline/product-enrichment-pipeline
{
  "processors": [
    {
      "script": {
        "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description",
        "params": {
          "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}"
        }
      }
    },
    {
      "inference": {
        "model_id": "azure_openai_completion",
        "input_output": {
          "input_field": "prompt",
          "output_field": "ai_response"
        }
      }
    },
    {
      "json": {
        "field": "ai_response",
        "add_to_root": true
      }
    },
    {
      "json": {
        "field": "ai_response",
        "add_to_root": true
      }
    },
    {
      "remove": {
        "field": [
          "prompt",
          "ai_response"
        ]
      }
    }
  ]
}

创建索引并写入数据

我们接下来创建一个叫做 products 的索引:

复制代码
PUT products
{
  "settings": {
    "default_pipeline": "product-enrichment-pipeline"
  }
}

如上所示,我们把 default_pipeline,也即默认的 pipeline 设置为 product-enrichment-pipeline。这样我们像正常地写入数据的时候,这个 pipeline 也会被自动调用:

复制代码
POST _bulk
{ "index": { "_index": "products", "_id": "1" } }
{ "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "price": 299.99 }
{ "index": { "_index": "products", "_id": "2" } }
{ "name": "Portable Bluetooth Speaker", "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.", "price": 149.99 }
{ "index": { "_index": "products", "_id": "3" } }
{ "name": "Studio Condenser Microphone", "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.", "price": 199.99 }

注意:依赖于大模型的速度,上面的调用可能需要一点时间来完成!

如上所示,我们写入数据。我们使用如下的命令来查看我们的数据:

复制代码
GET products/_search?filter_path=**.hits

{
  "hits": {
    "hits": [
      {
        "_index": "products",
        "_id": "1",
        "_score": 1,
        "_source": {
          "use_case": "Travel",
          "features": [
            "wireless",
            "noise_cancellation",
            "long_battery"
          ],
          "price": 299.99,
          "name": "Wireless Noise-Canceling Headphones",
          "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
          "model_id": "azure_openai_completion",
          "category": "Headphones"
        }
      },
      {
        "_index": "products",
        "_id": "2",
        "_score": 1,
        "_source": {
          "use_case": "Travel",
          "features": [
            "waterproof",
            "surround_sound"
          ],
          "price": 149.99,
          "name": "Portable Bluetooth Speaker",
          "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.",
          "model_id": "azure_openai_completion",
          "category": "Speakers"
        }
      },
      {
        "_index": "products",
        "_id": "3",
        "_score": 1,
        "_source": {
          "use_case": "Studio",
          "features": [
            "noise_cancellation",
            "voice_assistant"
          ],
          "price": 199.99,
          "name": "Studio Condenser Microphone",
          "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.",
          "model_id": "azure_openai_completion",
          "category": "Microphones"
        }
      }
    ]
  }
}

有了如上所示的结构化数据,我们就可以针对我们的数据进行搜索或统计了。

祝大家学习愉快!

相关推荐
收菜福星几秒前
当AI Agent成为大学标配:2026年学习模式的深层变革
人工智能·学习
七夜zippoe2 分钟前
CANN Runtime调试支持模块 算子中间结果保存与校验源码解析
人工智能
lili-felicity4 分钟前
CANN多设备协同推理:从单机到集群的扩展之道
大数据·人工智能
三克的油4 分钟前
ros-day3
人工智能
聆风吟º9 分钟前
CANN ops-math 应用指南:从零搭建高效、可复用的自定义 AI 计算组件
人工智能·机器学习·cann
熊文豪13 分钟前
从零开始:基于CANN ops-transformer的自定义算子开发指南
人工智能·深度学习·transformer·cann
云边有个稻草人18 分钟前
基于CANN ops-nn的AIGC神经网络算子优化与落地实践
人工智能·神经网络·aigc
chian-ocean19 分钟前
视觉新范式:基于 `ops-transformer` 的 Vision Transformer 高效部署
人工智能·深度学习·transformer
程序猿追22 分钟前
探索 CANN Graph 引擎的计算图编译优化策略:深度技术解读
人工智能·目标跟踪
哈__22 分钟前
CANN加速语音识别ASR推理:声学模型与语言模型融合优化
人工智能·语言模型·语音识别