Elasticsearch字段数据类型

1. 前言

ES文档的每个字段都至少有一个数据类型,此类型决定了字段值如何被存储以及检索。例如,字符串类型可以定义为text或者keyword,前者用于全文检索,会经过分词后索引;后者用于精准匹配,值会保持原样被索引。

ES字段类型按族分组,同一族中的类型具有完全相同的搜索行为,但可能具有不同的空间使用或性能特征。

2. 基本数据类型

2.1 binary

二进制数据类型,接受以Base64编码的二进制数据作为输入,默认不可索引和搜索。

json 复制代码
// 创建索引
PUT files
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text"
      },
      "blob":{
        "type": "binary"
      }
    }
  }
}
// 索引文档
POST files/_doc
{
  "title":"hello.txt",
  "blob":"aGVsbG8gd29ybGQ="
}
//获取文档
GET files/_search

{
  "took": 30,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "files",
        "_id": "7-tinY4BODFb3LbQRHQD",
        "_score": 1,
        "_source": {
          "title": "hello.txt",
          "blob": "aGVsbG8gd29ybGQ="
        }
      }
    ]
  }
}

二进制数据类型字段不会被索引,固无法被搜索,否则会报错

json 复制代码
{
  "error": {
    "root_cause": [
      {
        "type": "query_shard_exception",
        "reason": "failed to create query: Binary fields do not support searching",
        "index_uuid": "IM89dUaXTuqYCgnmMlb9VQ",
        "index": "files"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "files",
        "node": "X6yldsn_RMuokM0m_Q5Z1g",
        "reason": {
          "type": "query_shard_exception",
          "reason": "failed to create query: Binary fields do not support searching",
          "index_uuid": "IM89dUaXTuqYCgnmMlb9VQ",
          "index": "files",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Binary fields do not support searching"
          }
        }
      }
    ]
  },
  "status": 400
}

2.2 boolean

布尔类型,接受JSON的 true 或 false值,也可以接受对应字符串格式作为输入。

json 复制代码
PUT users
{
  "mappings": {
    "properties": {
      "name":{
        "type": "keyword"
      },
      "deleted":{
        "type": "boolean"
      }
    }
  }
}

POST users/_doc
{
  "name":"Lisa",
  "deleted":false
}

2.3 Keywords

关键字类型族,包括:keyword、constant_keyword和wildcard。

2.3.1 keyword

关键字类型,用于存储结构化内容,如id、电子邮件地址、主机名、状态码、邮政编码或标签。用于精准匹配、聚合、以及排序。

json 复制代码
PUT users
{
  "mappings": {
    "properties": {
      "user_id":{
        "type": "keyword"
      }
    }
  }
}

2.3.2 constant_keyword

常量关键字类型,它的目的是让文档中的字段具有相同的值,什么意思呢?

举个例子,创建一个logs索引,其中level字段定义为constant_keyword类型,值是"debug"

json 复制代码
PUT logs
{
  "mappings": {
    "properties": {
      "content":{
        "type": "text"
      },
      "level":{
        "type": "constant_keyword",
        "value":"debug"
      }
    }
  }
}

下面两个索引请求是等价的,level最终都是"debug"

json 复制代码
POST logs/_doc
{
  "content":"haha",
  "level":"debug"
}

POST logs/_doc
{
  "content":"haha"
}

但是,如果索引一个非法的level值,就会得到一个异常

json 复制代码
POST logs/_doc
{
  "content":"haha",
  "level":"info"
}

{
  "error": {
    "root_cause": [
      {
        "type": "document_parsing_exception",
        "reason": "[3:11] failed to parse field [level] of type [constant_keyword] in document with id '9ut5nY4BODFb3LbQtHRG'. Preview of field's value: 'info'"
      }
    ],
    "type": "document_parsing_exception",
    "reason": "[3:11] failed to parse field [level] of type [constant_keyword] in document with id '9ut5nY4BODFb3LbQtHRG'. Preview of field's value: 'info'",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "[constant_keyword] field [level] only accepts values that are equal to the value defined in the mappings [debug], but got [info]"
    }
  },
  "status": 400
}

总结一下,constant_keyword目的是让索引内的文档字段具有相同的值,如果映射没有定义默认值,则以第一个索引到的不为空的字段值作为默认值,如果再索引到不同的值,ES会抛出异常。

2.3.3 wildcard

通配符字段类型,可以在字符串中实现通配符的模式查找。

如下示例,创建books索引并索引文档

json 复制代码
PUT books
{
  "mappings": {
    "properties": {
      "title":{
        "type": "wildcard"
      }
    }
  }
}

POST books/_doc
{
  "title":"C语言程序设计"
}

使用通配符查找书名

json 复制代码
GET books/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "*语言*设计"
      }
    }
  }
}

{
  "took": 11,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "books",
        "_id": "9-uInY4BODFb3LbQW3Sn",
        "_score": 1,
        "_source": {
          "title": "C语言程序设计"
        }
      }
    ]
  }
}

2.4 Numbers

数字类型族,包括:long、integer、short、byte、double、float、half_float、scaled_float、unsigned_long。

除了整型和浮点数的区别外,其它就是长度不一样导致表示的数值范围不同。在数据类型的选择上,建议在满足需求的基础上选择占用空间最小的数据类型以节省存储空间和检索性能。

2.5 Dates

日期类型,包括:date和date_nanos。因为JSON并没有日期类型,所以ES可以接收的日期输入是:格式化的日期字符串和时间戳。

2.5.1 date

日期类型,可以通过format指定日期的格式。如下示例,接受格式化的日期字符串或时间戳作为输入值

json 复制代码
PUT dates
{
  "mappings": {
    "properties": {
      "date":{
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
      }
    }
  }
}

以下两种索引请求均支持

json 复制代码
POST dates/_doc
{
 "date":"2024-01-01 00:00:00" 
}

POST dates/_doc
{
 "date":1618321898000
}

2.5.2 date_nanos

纳秒级的时间类型,是对日期数据类型的补充。区别是:date使用毫秒精度存储日期、date_nanos使用纳秒精度存储日期,date_nanos日期范围被限制在大约1970到2262之间,它存储的是自epoch依赖的纳秒时长。

如下示例,接收格式化的日期字符串或时间戳作为输入

json 复制代码
PUT date-nanos
{
  "mappings": {
    "properties": {
      "date_nanos":{
        "type": "date_nanos",
        "format": "strict_date_optional_time_nanos||epoch_millis"
      }
    }
  }
}

索引文档请求

json 复制代码
POST date-nanos/_doc
{
 "date_nanos":"2024-01-01T12:00:00.123456789Z" 
}

2.6 alias

别名类型,它可以用来给索引中的字段定义别名。

如下示例,我们给field_a定义一个别名字段

json 复制代码
PUT alias-index
{
  "mappings": {
    "properties": {
      "field_a":{
        "type": "keyword"
      },
      "field_a_alias":{
        "type": "alias",
        "path":"field_a"
      }
    }
  }
}

如此一来,下面两个搜索请求是等价的:

json 复制代码
GET alias-index/_search
{
  "query": {
    "term": {
      "field_a": {
        "value": "haha"
      }
    }
  }
}

GET alias-index/_search
{
  "query": {
    "term": {
      "field_a_alias": {
        "value": "haha"
      }
    }
  }
}

3. 对象和关系类型

3.1 object

JSON文档是分层的,文档可能还会包含内部对象,如下索引文档请求示例:

json 复制代码
POST users/_doc
{
  "name":"张三",
  "address":{
    "province":"浙江省",
    "city":"杭州市"
  }
}

在ES内部,该文档会被索引为简单的键值对列表:

json 复制代码
{
  "name":"张三",
  "address.province":"浙江省",
  "address.city":"杭州市"
}

索引映射看起来是下面这样的,address包含province和city两个子字段,address并没有显式的指定type=object,这是默认值。

json 复制代码
{
  "users": {
    "mappings": {
      "properties": {
        "address": {
          "properties": {
            "city": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "province": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

3.2 flattened

默认情况下,ES的配置dynamic:true,即允许动态映射添加新字段,如果不加约束,可能会导致索引映射字段激增,最终超过索引字段数限制index.mapping.total_fields.limit:1000 ,这种非预期的字段数激增被称作"字段膨胀"。

尤其是索引结构复杂的文档,如下示例,会导致索引的字段结构变得混乱:

json 复制代码
POST logs/_doc
{
  "project":"user-server",
  "content":{
    "field_a":{
      "a":{
        "a1":{
          "a1_1":{},
          "a1_2":{},
          "a1_3":{}
        },
        "a2":{},
        "a3":{}
      },
      "b":{
        
      },
      "c":{
        
      }
      ...
    },
    "field_b":{
      ......
    }
  }
}

在这种情况下,当面临索引包含大量不可预测字段的文档时,可以将字段类型设置为"flattened"来避免字段膨胀的问题。"flattened"译为"扁平",它会将整个嵌套的JSON对象索引为单个keyword类型,以减少字段总数。

如下示例,将用户地址设为flattened类型,无论address对象结构如何复杂,索引字段数都不会变。

json 复制代码
PUT users
{
  "mappings": {
    "properties": {
      "name":{
        "type": "keyword"
      },
      "address":{
        "type": "flattened"
      }
    }
  }
}

索引如下文档请求:

json 复制代码
POST users/_doc
{
  "name":"张三",
  "address":{
    "province":"浙江省",
    "city":"杭州市",
    "region":"西湖区",
    "street":"文一西路",
    "detail":"某某小区1幢1号"
  }
}

查看索引映射,依旧只有俩字段

json 复制代码
GET users/_mapping

{
  "users": {
    "mappings": {
      "properties": {
        "address": {
          "type": "flattened"
        },
        "name": {
          "type": "keyword"
        }
      }
    }
  }
}

索引字段更新具备额外的开销,ES必须为每个字段更新集群状态,跨节点的集群状态传输是单线程操作的,需要更新的字段越多,所需的时间就越长,甚至导致整个集群宕机。

3.3 nested

嵌套数据类型,它和object类似也被用来存储JSON对象或数组,它允许以一种可以彼此独立查询的方式对对象数组进行索引。什么意思呢?

看一个例子,创建users索引,并索引文档,lisa有男性Jack和女性Ruth两位朋友。

json 复制代码
PUT users

POST users/_doc
{
  "name":"lisa",
  "friends":[{
    "name":"Jack",
    "gender":"男"
  },{
    "name":"Ruth",
    "gender":"女"
  }]
}

接下来,我们执行bool查询,搜索具有女性Jack朋友的用户,按照正常的业务逻辑不应该召回任何文档,但是结果出乎意料地返回了lisa

json 复制代码
GET users/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "friends.name": "Jack"
          }
        },
        {
          "match": {
            "friends.gender": "女"
          }
        }
      ]
    }
  }
}

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "users",
        "_id": "COvanY4BODFb3LbQgXW-",
        "_score": 0.5753642,
        "_source": {
          "name": "lisa",
          "friends": [
            {
              "name": "Jack",
              "gender": "男"
            },
            {
              "name": "Ruth",
              "gender": "女"
            }
          ]
        }
      }
    ]
  }
}

出现这种情况的原因是:ES文档没有内部对象的概念,它会将对象层次结构扁平化为一个简单的键值对列表,值使用数组存储,就像这样:

json 复制代码
{
    "name":"lisa",
    "friends.name":["Jack","Ruth"],
    "friends.gender":["男","女"]
}

单个friend的关系已经丢失,示例中的搜索条件是:("Jack" in friends.name) and ("女" in friends.gender) ,所以lisa被召回就可以理解了。

要解决这个问题,本质上是要索引对象数组并维护数组中每个对象的独立性,此时可以使用ES提供的"nested"数据类型,如下所示:

json 复制代码
PUT users
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "friends": {
        "type": "nested",
        "properties": {
          "name": {
            "type": "keyword"
          },
          "gender": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

搜索方式改为nested,如下所示,将不会再找回任何文档

json 复制代码
GET users/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "friends",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "friends.name": "Jack"
                    }
                  },
                  {
                    "match": {
                      "friends.gender": "女"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

nested类型将嵌套对象索引为单独的隐藏文档,使其可以独立于其它对象而查询每个对象。上述示例中,lisa.friends的存储结构就变成了下面这样,每个friend都单独存储为内部的隐藏文档。

json 复制代码
{
  {
   "friends.name":["Jack"],
   "friends.gender":["男"]
  },
  {
    "friends.name":["Ruth"],
   "friends.gender":["女"]
  }
}

3.4 join

join数据类型用来给同一索引中的文档创建父子关系,有点类似于关系数据库中的表连接。

如下示例,创建一个问答索引用于搜索问题和答案:

json 复制代码
PUT question-answer-index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      },
      "join_field": {
        "type": "join",
        "relations": {
          "question": "answer"
        }
      }
    }
  }
}

分别索引父文档和子文档,其中索引子文档时路由值routing是必须的,因为ES必须保证父子文档索引在同一个分片里。

json 复制代码
// 父文档
POST question-answer-index/_doc/1
{
  "content": "Elasticsearch是什么?",
  "join_field": {
    "name": "question"
  }
}

// 子文档
POST question-answer-index/_doc/2?routing=1
{
  "content": "Elasticsearch是位于Elastic Stack 核心的分布式搜索和分析引擎。",
  "join_field": {
    "name": "answer",
    "parent": "1"
  }
}

根据父文档搜索子文档:

json 复制代码
GET question-answer-index/_search
{
  "query": {
    "has_parent": {
      "parent_type": "question",
      "query": {
        "match": {
          "content": "Elasticsearch"
        }
      }
    }
  }
}

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "question-answer-index",
        "_id": "2",
        "_score": 1,
        "_routing": "1",
        "_source": {
          "content": "Elasticsearch是位于Elastic Stack 核心的分布式搜索和分析引擎。",
          "join_field": {
            "name": "answer",
            "parent": "1"
          }
        }
      }
    ]
  }
}

根据子文档搜索父文档:

json 复制代码
GET question-answer-index/_search
{
  "query": {
    "has_child": {
      "type": "answer",
      "query": {
        "match": {
          "content": "Elasticsearch"
        }
      }
    }
  }
}

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "question-answer-index",
        "_id": "1",
        "_score": 1,
        "_source": {
          "content": "Elasticsearch是什么?",
          "join_field": {
            "name": "question"
          }
        }
      }
    ]
  }
}

除了示例中的一对一的关系,join也支持定义一对多的关联关系,join本身支持纵向的层级结构,除了定义父子关系,还可以定义子孙关系,但是都必须保证有关系的文档索引在同一分片里。

join类型的一些限制需要注意:

  • 每个索引,仅允许定义一个join类型的字段
  • 父子文档必须索引在同一个分片里
  • 一个父文档可以有多个子文档,一个子文档只能有一个父文档

4. 结构化数据类型

4.1 Range

范围字段类型表示一个连续的值范围,由上边界和下边界组成。ES目前支持的范围类型有:整型范围、浮点数范围、日期范围和ip范围。

如下示例,我们创建一个meetings索引,其中会议的持续时间duration字段就是一个日期范围类型

json 复制代码
PUT meetings
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text"
      },
      "duration":{
        "type": "date_range",
        "format": "yyyy-MM-dd HH:mm:ss"
      }
    }
  }
}

接下来索引文档

json 复制代码
POST meetings/_doc
{
  "title":"需求评审会",
  "duration":{
    "gte":"2024-01-01 12:00:00",
    "lte":"2024-01-01 12:30:00"
  }
}

使用term查询即可,只要查询的值在给定的时间范围内就可以召回文档

json 复制代码
GET meetings/_search
{
  "query": {
    "term": {
      "duration": {
        "value": "2024-01-01 12:10:00"
      }
    }
  }
}

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "meetings",
        "_id": "CuuRoY4BODFb3LbQk3WN",
        "_score": 1,
        "_source": {
          "title": "需求评审会",
          "duration": {
            "gte": "2024-01-01 12:00:00",
            "lte": "2024-01-01 12:30:00"
          }
        }
      }
    ]
  }
}

4.2 ip

ip类型字段可以用来存储IPv4和IPv6地址。

如下示例,创建一个computers索引

json 复制代码
PUT computers
{
  "mappings": {
    "properties": {
      "name":{
        "type": "keyword"
      },
      "ip":{
        "type": "ip"
      }
    }
  }
}

索引文档

json 复制代码
POST computers/_doc
{
  "name":"C01",
  "ip":"192.168.0.1"
}
POST computers/_doc
{
  "name":"C02",
  "ip":"192.168.0.2"
}

可以根据ip精确匹配

json 复制代码
GET computers/_search
{
  "query": {
    "term": {
      "ip": {
        "value": "192.168.0.1"
      }
    }
  }
}

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "computers",
        "_id": "DeuloY4BODFb3LbQV3UG",
        "_score": 1,
        "_source": {
          "name": "C01",
          "ip": "192.168.0.1"
        }
      }
    ]
  }
}

也可以用CIDR表示法根据前缀查询

json 复制代码
GET computers/_search
{
  "query": {
    "term": {
      "ip": {
        "value": "192.168.0.0/24"
      }
    }
  }
}

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "computers",
        "_id": "DeuloY4BODFb3LbQV3UG",
        "_score": 1,
        "_source": {
          "name": "C01",
          "ip": "192.168.0.1"
        }
      },
      {
        "_index": "computers",
        "_id": "DuuloY4BODFb3LbQW3XF",
        "_score": 1,
        "_source": {
          "name": "C02",
          "ip": "192.168.0.2"
        }
      }
    ]
  }
}

5. 文本搜索类型

5.1 Text

文本类型族,包含text和match_only_text,经过分析的非结构化文本数据。

5.1.1 text

索引全文值的字段,例如新闻内容、商品介绍等,ES在索引之前会先经过分析器将字符串转换成单个词列表再存储,默认不存储原始全文值,所以无法通过精准匹配来检索text类型,也不可用于排序和聚合,text适合存储非结构化但人类可阅读的文本内容。

如下示例,创建一个news索引,用于索引新闻数据

json 复制代码
PUT news
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text"
      },
      "content":{
        "type": "text"
      }
    }
  }
}

索引两篇新闻文档

json 复制代码
POST news/_doc
{
  "title":"美国苹果公司发布新款iPhone",
  "content":"新iPhone发布!美国苹果公司震撼发布最新款iPhone。精致设计,强劲性能,和先进功能的完美结合。期待它带来的创新体验!"
}

POST news/_doc
{
  "title":"今年苹果预计将推迟上市",
  "content":"据报道,今年水果市场上备受期待的苹果预计将推迟上市。这可能是由于天气等因素导致苹果的生长和成熟过程延迟。消费者需稍作等待,相信在不久的将来将能品尝到新鲜甜蜜的苹果。"
}

最后通过multi_match来检索文档,检索的字段包括title和content,其中title具备更高的权重,文档得分会更高

json 复制代码
GET news/_search
{
  "query": {
    "multi_match": {
      "query": "苹果",
      "fields": ["title^2","content"]
    }
  }
}

{
  "took": 11,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.72928625,
    "hits": [
      {
        "_index": "news",
        "_id": "meQwuI4BPIYet_3flP8X",
        "_score": 0.72928625,
        "_source": {
          "title": "美国苹果公司发布新款iPhone",
          "content": "新iPhone发布!美国苹果公司震撼发布最新款iPhone。精致设计,强劲性能,和先进功能的完美结合。期待它带来的创新体验!"
        }
      },
      {
        "_index": "news",
        "_id": "muQwuI4BPIYet_3fl_9V",
        "_score": 0.72928625,
        "_source": {
          "title": "今年苹果预计将推迟上市",
          "content": "据报道,今年水果市场上备受期待的苹果预计将推迟上市。这可能是由于天气等因素导致苹果的生长和成熟过程延迟。消费者需稍作等待,相信在不久的将来将能品尝到新鲜甜蜜的苹果。"
        }
      }
    ]
  }
}

5.1.2 match_only_text

Elasticsearch7.14推出的全新文本类型,它和text的区别是:match_only_text不存储长度归一化因子、词频数据、位置数据,所以match_only_text不支持文档评分,带来的好处就是比text更节省存储空间,非常适合用于存储日志。

如下示例,创建logs索引,用于索引日志数据,其中text字段类型是match_only_text

json 复制代码
PUT logs
{
  "mappings": {
    "properties": {
      "level":{
        "type": "keyword"
      },
      "text":{
        "type": "match_only_text"
      }
    }
  }
}

索引一些日志

json 复制代码
POST logs/_doc
{
  "level":"info",
  "text":"This is the first log"
}

POST logs/_doc
{
  "level":"info",
  "text":"This is the second log log"
}

最后,通过match检索日志,第二条日志因为有两个"log"所以评分理论上会更高,但是match_only_text不记录词频,所以也就不参与评分,所以两条日志的评分结果都是1

json 复制代码
GET logs/_search
{
  "query": {
    "match": {
      "text": "log"
    }
  }
}

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "logs",
        "_id": "m-Q5uI4BPIYet_3f4P8H",
        "_score": 1,
        "_source": {
          "level": "info",
          "text": "This is the first log"
        }
      },
      {
        "_index": "logs",
        "_id": "nOQ5uI4BPIYet_3f5f8j",
        "_score": 1,
        "_source": {
          "level": "info",
          "text": "This is the second log log"
        }
      }
    ]
  }
}

5.2 completion

completion类型主要用于搜索建议和自动补全,如果你要做一个类似百度搜索的联想提示功能,那么就可以使用ES的completion类型。completion类型的suggest性能非常高,ES使用了一种特殊的结构用于前缀搜索,并且数据会缓存在内存中。

如下示例,创建一个webpages索引,用于索引网页,其中suggest字段使用completion类型,数据来源是title

json 复制代码
PUT webpages
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "copy_to": "suggest"
      },
      "suggest": {
        "type": "completion"
      },
      "url": {
        "type": "keyword"
      }
    }
  }
}

接下来,我们索引几篇文档

json 复制代码
POST webpages/_doc
{
  "title":"Java-百度百科",
  "url":"https://baike.baidu.com/item/Java/85979"
}

POST webpages/_doc
{
  "title":"java能做什么",
  "url":"https://baijiahao.baidu.com/s?id=1765494775400937848"
}

POST webpages/_doc
{
  "title":"python编程从入门到精通",
  "url":"https://baijiahao.baidu.com/s?id=1764847625088362688"
}

最后,通过suggest检索自动补全的数据

json 复制代码
GET webpages/_search
{
  "suggest": {
    "title-suggestion": {
      "text": "java",
      "completion": {
        "field": "suggest"
      }
    }
  }
}

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "suggest": {
    "title-suggestion": [
      {
        "text": "java",
        "offset": 0,
        "length": 4,
        "options": [
          {
            "text": "Java-百度百科",
            "_index": "webpages",
            "_id": "neRCuI4BPIYet_3f8v8w",
            "_score": 1,
            "_source": {
              "title": "Java-百度百科",
              "url": "https://baike.baidu.com/item/Java/85979"
            }
          },
          {
            "text": "java能做什么",
            "_index": "webpages",
            "_id": "nuRCuI4BPIYet_3f9f9k",
            "_score": 1,
            "_source": {
              "title": "java能做什么",
              "url": "https://baijiahao.baidu.com/s?id=1765494775400937848"
            }
          }
        ]
      }
    ]
  }
}

6. 空间数据类型

ES支持的空间数据类型比较丰富,包括:geo_point、geo_shape、point、shape。

6.1 geo_point

geo_point类型用来存储地址位置经纬度坐标,用于地址位置的搜索和聚合分析,例如实现:附近的人、附近的店铺等功能。

如下示例,创建hotels索引,索引酒店信息,其中location字段使用geo_point类型

json 复制代码
PUT hotels
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "location":{
        "type": "geo_point"
      }
    }
  }
}

再索引两个文档

json 复制代码
POST hotels/_doc
{
  "name":"如家酒店",
  "location":{
    "lat":10.1,
    "lon":10.1
  }
}

POST hotels/_doc
{
  "name":"亚朵酒店",
  "location":{
    "lat":10.2,
    "lon":10.2
  }
}

最后,给定一个圆心坐标查询附近10km的酒店

json 复制代码
GET hotels/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "geo_distance": {
            "distance": "10km",
            "location": {
              "lat": 10.15,
              "lon": 10.15
            }
          }
        }
      ]
    }
  }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0,
    "hits": [
      {
        "_index": "hotels",
        "_id": "puRxuI4BPIYet_3fT__u",
        "_score": 0,
        "_source": {
          "name": "如家酒店",
          "location": {
            "lat": 10.1,
            "lon": 10.1
          }
        }
      },
      {
        "_index": "hotels",
        "_id": "p-RxuI4BPIYet_3fUv_f",
        "_score": 0,
        "_source": {
          "name": "亚朵酒店",
          "location": {
            "lat": 10.2,
            "lon": 10.2
          }
        }
      }
    ]
  }
}

还可以使用geo_bounding_box矩形搜索,给定矩形的左上方和右下方坐标即可

json 复制代码
GET hotels/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "geo_bounding_box": {
            "location": {
              "top_left": {
                "lat": 10.175,
                "lon": 10
              },
              "bottom_right": {
                "lat": 10,
                "lon": 10.175
              }
            }
          }
        }
      ]
    }
  }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0,
    "hits": [
      {
        "_index": "hotels",
        "_id": "puRxuI4BPIYet_3fT__u",
        "_score": 0,
        "_source": {
          "name": "如家酒店",
          "location": {
            "lat": 10.1,
            "lon": 10.1
          }
        }
      }
    ]
  }
}

6.2 geo_shape

geo_point用来在空间里定义一个点,geo_shape则可以用来定义一个形状,支持:点、线、圆、矩形、多边形。例如可以用geo_shape来表示一个景点的区域、一个停车场的范围等。

如下示例,创建scenic-spots索引用于索引景点信息,其中location_shape字段使用geo_shape类型

json 复制代码
PUT scenic-spots
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "location_shape":{
        "type": "geo_shape"
      }
    }
  }
}

我们使用polygon多边形来索引杭州西湖和西溪湿地两个景点

json 复制代码
POST scenic-spots/_doc
{
  "name":"杭州西湖",
  "location_shape":{
    "type":"polygon",
    "coordinates":[[
      [120.13,30.25],
      [120.16,30.26],
      [120.16,30.25],
      [120.15,30.23],
      [120.14,30.23],
      [120.13,30.25]
    ]]
  }
}

POST scenic-spots/_doc
{
  "name":"杭州西溪湿地",
  "location_shape":{
    "type":"polygon",
    "coordinates":[[
      [120.05,30.28],
      [120.09,30.28],
      [120.09,30.26],
      [120.04,30.25],
      [120.05,30.28]
    ]]
  }
}

最后通过形状搜索,我们使用within即搜索形状必须包含在索引形状内,我们搜索西湖内的三潭印月区域,结果返回西湖

json 复制代码
GET scenic-spots/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "geo_shape": {
            "location_shape": {
              "shape": {
                "type": "polygon",
                "relation": "within",
                "coordinates": [[
                  [120.14,30.24],
                  [120.145,30.245],
                  [120.135,30.235],
                  [120.14,30.24]
                ]]
              }
            }
          }
        }
      ]
    }
  }
}

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0,
    "hits": [
      {
        "_index": "scenic-spots",
        "_id": "xeQ7vI4BPIYet_3fHP81",
        "_score": 0,
        "_source": {
          "name": "杭州西湖",
          "location_shape": {
            "type": "polygon",
            "coordinates": [
              [
                [
                  120.13,
                  30.25
                ],
                [
                  120.16,
                  30.26
                ],
                [
                  120.16,
                  30.25
                ],
                [
                  120.15,
                  30.23
                ],
                [
                  120.14,
                  30.23
                ],
                [
                  120.13,
                  30.25
                ]
              ]
            ]
          }
        }
      }
    ]
  }
}

除了within,还支持disjoint 搜索形状和索引形状不重叠、intersects 搜索形状和索引形状有重叠部分。

相关推荐
向阳12181 小时前
Kafka快速入门
java·大数据·分布式·kafka·mq
程序员小潘2 小时前
初识Flink
大数据·flink
happy_king_zi2 小时前
Flink On kubernetes
大数据·flink·kubernetes
间彧2 小时前
ELK简介及Head插件
elasticsearch
向阳12183 小时前
Flink入门
大数据·flink
程序员小潘3 小时前
Elasticsearch文档操作
大数据·elasticsearch
一勺菠萝丶3 小时前
如何解决Elasticsearch容器因“Connection refused”导致的问题
大数据·elasticsearch·jenkins
武子康3 小时前
大数据-171 Elasticsearch ES-Head 与 Kibana 配置 使用 测试
大数据·elasticsearch·搜索引擎·flink·spark·全文检索·kylin
CtrlCV 攻城狮4 小时前
Elasticsearch是做什么的?
大数据·elasticsearch·搜索引擎