Elasticsearch附件管道使用总结

背景:希望做一个全文搜索的功能,主要目的是为了对附件进行检索

进行使用过程总结,避免重复学习

那么开始查阅文档 Elasticsearch:官方分布式搜索和分析引擎

1.熟悉基础API

首先可以先熟悉基本的创建索引、配置映射、写入文档、简单检索。

2.了解管道功能,进行附件处理

单附件管道

http 复制代码
PUT _ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    {
      "attachment": {
        "field": "content",
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "content"
      }
    }
  ]
}

这个很好理解,将attachment对象中content字段进行解析,然后通过remove移除源字段content。

多附件管道

http 复制代码
PUT _ingest/pipeline/test_attachment
{
  "description": "Extract attachment information",
  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.content",
            "target_field": "_ingest._value.attachment"
          }
        }
      }
    },
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "remove": {
            "field": [
              "_ingest._value.content"
            ]
          }
        }
      }
    }
  ]
}

添加多个附件后,文档如图所示

该管道第一个foreach功能表示使用attachment处理器, 将attachments中每条记录的content通过管道解析后抽取到 attachments - attachment(该对象下便是解析后的文件信息)

单附件 多附件 同时处理

合并后就可以同时处理单附件和多附件了, 合并后效果如下

http 复制代码
PUT _ingest/pipeline/test_attachment
{
  "description": "Extract attachment information",
  "processors": [
    
    {
      "attachment": {
        "field": "content",
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "content"
      }
    },
    
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.content",
            "target_field": "_ingest._value.attachment"
          }
        }
      }
    },
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "remove": {
            "field": [
              "_ingest._value.content"
            ]
          }
        }
      }
    }
  ]
}

示例:

写入文档时注意添加索引

ini 复制代码
索引名称/_doc?pipeline=管道名称

传3个附件 content: 附件1 attachments: 多附件列表(附件2 附件3)

http 复制代码
POST test_create_index/_doc?pipeline=test_attachment
{
  "content": "eyJzdWNjZXNzIjpmYWxzZSwiY29kZSI6MCwibXNnIjoi5LiL6L295aSx6LSlIn0=",
  "attachments": [
    {
      "content": "aGFoYeS9oOWlvWhlbGxvd29ybGQ="
    },
    {
      "content": "aGFoYeS9oOWlvWhlbGxvd29ybGQ="
    }
  ]
}

最终写入的文档

json 复制代码
{
  "_index" : "test_create_index",
  "_type" : "_doc",
  "_id" : "-H0eqowB6o1OPMJBsiNg",
  "_version" : 1,
  "_seq_no" : 46,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "attachments" : [
      {
        "attachment" : {
          "content_type" : "text/plain; charset=UTF-8",
          "language" : "hu",
          "content" : "文档2",
          "content_length" : 17
        }
      },
      {
        "attachment" : {
          "content_type" : "text/plain; charset=UTF-8",
          "language" : "et",
          "content" : "文档3",
          "content_length" : 607
        }
      }
    ],
    "attachment" : {
      "content_type" : "text/plain; charset=UTF-8",
      "language" : "gl",
      "content" : "文档1",
      "content_length" : 40
    }
  }
}

3.总结

所以在处理数据时,我们可以为每个索引构建一个专属于它使用的管道,来处理它的附件信息

所以在构建文档时需要先处理数据

假设一个文章对象 Article

java 复制代码
class Article {
    String id;
    String name;
    File img; //封面
    List<File> attachments; // 附件
}

现在获取到数据如下

json 复制代码
 [
     {
         "id":"1",
         "name":"今日头条",
         "img": "<file>",
         "attachments": [
             {
                 "fileName":"头条1",
                 "InputStream":""
             },
             {
                 "fileName":"头条2",
                 "InputStream":""
             }
         ]
     },
     {
         "id":"2",
         "name":"下期预告",
         "img": "<file>",
         "attachments": [
             {
                 "fileName":"预告1",
                 "InputStream":""
             },
             {
                 "fileName":"预告2",
                 "InputStream":""
             }
         ]
     }
 ]

我们构造的映射信息

json 复制代码
 "mappings": {
     "properties": {
         "attachments": { // 多附件映射
             "properties": {
                 "attachment": {
                     "properties": {
                         "content_type": {
                             "type": "text",
                             "fields": {
                                 "keyword": {
                                     "ignore_above": 256,
                                     "type": "keyword"
                                 }
                             }
                         },
                         "language": {
                             "type": "text",
                             "fields": {
                                 "keyword": {
                                     "ignore_above": 256,
                                     "type": "keyword"
                                 }
                             }
                         },
                         "content": {
                             "type": "text",
                             "fields": {
                                 "keyword": {
                                     "ignore_above": 256,
                                     "type": "keyword"
                                 }
                             }
                         },
                         "content_length": {
                             "type": "long"
                         }
                     }
                 }
             }
         },
         "attachment": { // 单附件映射
             "properties": {
                 "date": {
                     "type": "date"
                 },
                 "content_type": {
                     "type": "text",
                     "fields": {
                         "keyword": {
                             "ignore_above": 256,
                             "type": "keyword"
                         }
                     }
                 },
                 "language": {
                     "type": "text",
                     "fields": {
                         "keyword": {
                             "ignore_above": 256,
                             "type": "keyword"
                         }
                     }
                 },
                 "content": {
                     "type": "text"
                 },
                 "content_length": {
                     "type": "long"
                 }
             }
         },
         "createTime": {
             "type": "text"
         },
         "name": {
             "type": "text"
         },
         "id": {
             "type": "keyword"
         },
         "type": {
             "type": "keyword"
         },
         "content": {
             "type": "text",
             "fields": {
                 "keyword": {
                     "ignore_above": 256,
                     "type": "keyword"
                 }
             }
         }
     }
 }
相关推荐
Mephisto.java4 小时前
【大数据学习 | Spark】Spark的改变分区的算子
大数据·elasticsearch·oracle·spark·kafka·memcache
mqiqe4 小时前
Elasticsearch 分词器
python·elasticsearch
小马爱打代码4 小时前
Elasticsearch简介与实操
大数据·elasticsearch·搜索引擎
java1234_小锋13 小时前
Elasticsearch是如何实现Master选举的?
大数据·elasticsearch·搜索引擎
梦幻通灵19 小时前
ES分词环境实战
大数据·elasticsearch·搜索引擎
Elastic 中国社区官方博客19 小时前
Elasticsearch 中的热点以及如何使用 AutoOps 解决它们
大数据·运维·elasticsearch·搜索引擎·全文检索
小黑屋说YYDS1 天前
ElasticSearch7.x入门教程之索引概念和基础操作(三)
elasticsearch
Java 第一深情1 天前
Linux上安装单机版ElasticSearch6.8.1
linux·elasticsearch·全文检索
KevinAha2 天前
Elasticsearch 6.8 分析器
elasticsearch
wuxingge2 天前
elasticsearch7.10.2集群部署带认证
运维·elasticsearch