背景:希望做一个全文搜索的功能,主要目的是为了对附件进行检索
进行使用过程总结,避免重复学习
那么开始查阅文档 Elasticsearch:官方分布式搜索和分析引擎
1.熟悉基础API
首先可以先熟悉基本的创建索引、配置映射、写入文档、简单检索。
2.了解管道功能,进行附件处理
单附件管道
http
PUT _ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{
"attachment": {
"field": "content",
"ignore_missing": true
}
},
{
"remove": {
"field": "content"
}
}
]
}
这个很好理解,将attachment对象中content字段进行解析,然后通过remove移除源字段content。
多附件管道
http
PUT _ingest/pipeline/test_attachment
{
"description": "Extract attachment information",
"processors": [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"field": "_ingest._value.content",
"target_field": "_ingest._value.attachment"
}
}
}
},
{
"foreach": {
"field": "attachments",
"processor": {
"remove": {
"field": [
"_ingest._value.content"
]
}
}
}
}
]
}
添加多个附件后,文档如图所示
该管道第一个foreach功能表示使用attachment处理器, 将attachments中每条记录的content通过管道解析后抽取到 attachments - attachment(该对象下便是解析后的文件信息)
单附件 多附件 同时处理
合并后就可以同时处理单附件和多附件了, 合并后效果如下
http
PUT _ingest/pipeline/test_attachment
{
"description": "Extract attachment information",
"processors": [
{
"attachment": {
"field": "content",
"ignore_missing": true
}
},
{
"remove": {
"field": "content"
}
},
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"field": "_ingest._value.content",
"target_field": "_ingest._value.attachment"
}
}
}
},
{
"foreach": {
"field": "attachments",
"processor": {
"remove": {
"field": [
"_ingest._value.content"
]
}
}
}
}
]
}
示例:
写入文档时注意添加索引
ini
索引名称/_doc?pipeline=管道名称
传3个附件 content: 附件1 attachments: 多附件列表(附件2 附件3)
http
POST test_create_index/_doc?pipeline=test_attachment
{
"content": "eyJzdWNjZXNzIjpmYWxzZSwiY29kZSI6MCwibXNnIjoi5LiL6L295aSx6LSlIn0=",
"attachments": [
{
"content": "aGFoYeS9oOWlvWhlbGxvd29ybGQ="
},
{
"content": "aGFoYeS9oOWlvWhlbGxvd29ybGQ="
}
]
}
最终写入的文档
json
{
"_index" : "test_create_index",
"_type" : "_doc",
"_id" : "-H0eqowB6o1OPMJBsiNg",
"_version" : 1,
"_seq_no" : 46,
"_primary_term" : 1,
"found" : true,
"_source" : {
"attachments" : [
{
"attachment" : {
"content_type" : "text/plain; charset=UTF-8",
"language" : "hu",
"content" : "文档2",
"content_length" : 17
}
},
{
"attachment" : {
"content_type" : "text/plain; charset=UTF-8",
"language" : "et",
"content" : "文档3",
"content_length" : 607
}
}
],
"attachment" : {
"content_type" : "text/plain; charset=UTF-8",
"language" : "gl",
"content" : "文档1",
"content_length" : 40
}
}
}
3.总结
所以在处理数据时,我们可以为每个索引构建一个专属于它使用的管道,来处理它的附件信息
所以在构建文档时需要先处理数据
假设一个文章对象 Article
java
class Article {
String id;
String name;
File img; //封面
List<File> attachments; // 附件
}
现在获取到数据如下
json
[
{
"id":"1",
"name":"今日头条",
"img": "<file>",
"attachments": [
{
"fileName":"头条1",
"InputStream":""
},
{
"fileName":"头条2",
"InputStream":""
}
]
},
{
"id":"2",
"name":"下期预告",
"img": "<file>",
"attachments": [
{
"fileName":"预告1",
"InputStream":""
},
{
"fileName":"预告2",
"InputStream":""
}
]
}
]
我们构造的映射信息
json
"mappings": {
"properties": {
"attachments": { // 多附件映射
"properties": {
"attachment": {
"properties": {
"content_type": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"language": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"content_length": {
"type": "long"
}
}
}
}
},
"attachment": { // 单附件映射
"properties": {
"date": {
"type": "date"
},
"content_type": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"language": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"content": {
"type": "text"
},
"content_length": {
"type": "long"
}
}
},
"createTime": {
"type": "text"
},
"name": {
"type": "text"
},
"id": {
"type": "keyword"
},
"type": {
"type": "keyword"
},
"content": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}