从零开发短视频电商 OpenSearch/ElasticSearch中数据类型介绍

文章目录

- 查询和排序问题
- 数据类型
- - Text（文本）
  - Keyword（关键字）
  - Wildcard（通配符）
  - Long、Integer、Short、Byte（整数类型）
  - Double、Float、Half_Float、Scaled_Float（浮点数类型）
  - Date、Date_Nanos（日期类型）
  - Date_Range（日期范围）
  - Boolean（布尔）
  - Binary（二进制）
  - Integer_Range、Float_Range、Long_Range、Double_Range（范围类型）
  - [Ip_Range（IP 地址范围）](#Ip_Range（IP 地址范围）)
  - Object（对象）
  - Nested（嵌套）
  - Flattened（扁平化）
  - [Ip（IP 地址）](#Ip（IP 地址）)
  - TokenCount（词条数量）
  - Percolator（过滤器查询）
  - Search_As_You_Type（搜索即时建议）
  - Rank_Feature、Rank_Features（排名特征、排名特征集）
  - Dense_Vector（稠密向量）

查询和排序问题

在Elasticsearch中，查询和过滤是搜索过程中的两个主要步骤，它们的执行时序和具体过程如下：

1.查询（Query）阶段：
- 匹配查询： 在这个阶段，Elasticsearch执行你定义的查询来找到与查询条件匹配的文档。查询可以是全文本查询、精确匹配、范围查询等。
- 评分（Scoring）： 对于匹配的文档，Elasticsearch会为每个文档计算一个分数，用于排序结果。评分考虑了文档与查询的匹配程度，以及其他相关因素

ini 复制代码

{
  "query": {
    "match": {
      "field": "value"
    }
  }
}

2.过滤（Filter）阶段：
- 筛选匹配文档： 在查询阶段找到匹配的文档后，过滤阶段用于筛选出满足附加条件的文档。过滤不涉及评分，只关注文档是否满足过滤条件。

ini 复制代码

{
  "query": {
    "match": {
      "field": "value"
    }
  },
  "filter": {
    "range": {
      "date": {
        "gte": "2023-01-01"
      }
    }
  }
}

在这个示例中，查询阶段找到包含指定字段和值的文档，然后过滤阶段通过范围条件筛选出那些日期在指定范围内的文档。

性能优化： 过滤通常比查询更快，因为它不需要进行评分计算。过滤条件可以包括范围、精确匹配、布尔条件等。

性能考虑： 过滤阶段通常比查询阶段执行得更快，因此，如果你只是需要精确地匹配文档而不关心评分，考虑使用过滤来提高性能。
缓存： 过滤条件的结果可以被缓存，以加速相同条件的后续查询。这对于重复性的查询非常有用。

例如：存储学生数据，不要评分，只搜索年龄大于30岁的学生。

在这种情况下，由于你不关心评分，而只关心筛选出年龄大于30岁的学生，使用过滤（Filter）通常会更快。过滤不涉及评分计算，而且它的结果可以被缓存，从而提高性能。

数据类型

当在 Elasticsearch 中选择数据类型时，应该根据数据的性质和使用场景来进行选择。以下是每个数据类型的常见应用场景和示例：

Text（文本）

应用场景： 存储全文本，支持全文本搜索和分析。
示例： 文章内容、评论、描述等。

text 用于全文本搜索和分析，而 keyword 用于精确匹配和聚合。text 会经历分词等分析过程，而 keyword 则保持原始字符串不变。

text 类型用于全文本搜索，将文本字段分析成词项（terms），并创建倒排索引以支持搜索。

存储在 keyword 类型字段上的文本不会被分析，整个字符串作为一个独立的项。主要用于精确匹配和聚合。

wildcard 主要用于支持通配符搜索，不涉及到分析。它可以用于匹配模式中的通配符表达式。

使用 text 类型字段进行Wildcard查询时，会根据分词结果进行匹配，而使用 keyword 类型字段进行Wildcard查询时，需要整个关键词与查询匹配。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "content": "Elasticsearch is a distributed search engine."
}

GET /my_index/_search
{
  "query": {
    "match": {
      "content": "search engine"
    }
  }
}

Keyword（关键字）

应用场景： 精确匹配，通常用于不需要进行全文本搜索的字段。
示例： 类别、标签、关键字等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "text_field": {
        "type": "text",
        "analyzer": "standard"
      },
      "keyword_field": {
        "type": "keyword"
      }
    }
  }
}


POST /my_index/_doc/1
{
  "text_field": "Elasticsearch is powerful",
  "keyword_field": "Elasticsearch is powerful"
}


GET /my_index/_search
{
  "query": {
    "term": {
      "keyword_field": "Elasticsearch"
    }
  }
}
结果：
   搜不到
   
GET /my_index/_search
{
  "query": {
    "wildcard": {
      "keyword_field": "power*"
    }
  }
}
结果：
   搜不到   

GET /my_index/_search
{
  "query": {
    "match": {
      "text_field": "Elasticsearch"
    }
  }
}
结果：
   搜索到结果

GET /my_index/_search
{
  "query": {
    "wildcard": {
      "text_field": "power*"
    }
  }
}
结果：
   搜索到结果

当你搜索 Elasticsearch 时，对于 text_field 可能会匹配到，因为它被分成了独立的词项。而对于 keyword_field，只有当你搜索整个精确的字符串 "Elasticsearch is powerful" 时才能匹配。

Wildcard（通配符）

应用场景： 支持通配符搜索。
示例： 使用通配符进行模糊搜索。

Wildcard 和 Text 都可以用于支持通配符搜索。

Wildcard 专门用于通配符搜索，而 Text 是一种通用的文本类型，支持全文本搜索、分析和其他文本处理操作。

通配符查询允许在搜索时使用通配符 * 或 ? 来匹配文档中的值。

* 匹配零个或多个字符，? 匹配一个字符。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "product_name": {
        "type": "wildcard"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "product_name": "Elasticsearch"
}

GET /my_index/_search
{
  "query": {
    "wildcard": {
      "product_name": "Elast*"
    }
  }
}

Long、Integer、Short、Byte（整数类型）

应用场景： 存储整数数据。
示例： 年龄、数量、计数等。

这些整数类型都用于存储整数数据，但有不同的存储范围。

integer、long、short 和 byte 分别表示 32 位、64 位、16 位和 8 位有符号整数。选择取决于你的数据范围。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "age": {
        "type": "long"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "age": 25
}

GET /my_index/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 20
      }
    }
  }
}

Double、Float、Half_Float、Scaled_Float（浮点数类型）

应用场景： 存储浮点数数据。
示例： 价格、坐标、百分比等。

这些浮点数类型都用于存储浮点数数据。

float 和 double 分别表示 32 位和 64 位浮点数，而 half_float 是 16 位浮点数。scaled_float 允许通过指定缩放因子来存储浮点数。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "price": {
        "type": "double"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "price": 49.99
}

GET /my_index/_search
{
  "query": {
    "range": {
      "price": {
        "lte": 50.0
      }
    }
  }
}

Date、Date_Nanos（日期类型）

应用场景： 存储日期和时间信息。
示例： 发布日期、事件时间等。

date 存储毫秒级的日期，date_nanos 存储纳秒级的日期，而 date_range 用于表示日期范围。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "publish_date": {
        "type": "date"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "publish_date": "2023-01-01T12:00:00"
}

GET /my_index/_search
{
  "query": {
    "range": {
      "publish_date": {
        "gte": "2023-01-01"
      }
    }
  }
}

Date_Range（日期范围）

应用场景： 表示日期范围。
示例： 活动日期范围、计划时间段等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "date_range": {
        "type": "date_range"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "date_range": {
    "gte": "2023-01-01",
    "lte": "2023-12-31"
  }
}


GET /my_index/_search
{
  "query": {
    "range": {
      "date_range": {
        "gte": "2023-06-01",
        "lte": "2023-09-01"
      }
    }
  }
}

Boolean（布尔）

应用场景： 存储布尔值。
示例： 是否完成、是否启用等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "is_completed": {
        "type": "boolean"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "is_completed": true
}

GET /my_index/_search
{
  "query": {
    "term": {
      "is_completed": true
    }
  }
}

Binary（二进制）

应用场景： 存储二进制数据，如图像、文档等。
示例： 图片、PDF 文档等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "image_data": {
        "type": "binary"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "image_data": "base64_encoded_image_data"
}

GET /my_index/_search
{
  "query": {
    "exists": {
      "field": "image_data"
    }
  }
}

Integer_Range、Float_Range、Long_Range、Double_Range（范围类型）

应用场景： 表示数值范围。
示例： 价格范围、评分范围等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "age_range": {
        "type": "integer_range"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "age_range": {
    "gte": 18,
    "lte": 35
  }
}

GET /my_index/_search
{
  "query": {
    "range": {
      "age_range": {
        "gte": 25,
        "lte": 40
      }
    }
  }
}

Ip_Range（IP 地址范围）

应用场景： 存储 IP 地址范围。
示例： IP 地址过滤、地址段范围等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "ip_address_range": {
        "type": "ip_range"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "ip_address_range": {
    "gte": "192.168.0.1",
    "lte": "192.168.0.255"
  }
}

GET /my_index/_search
{
  "query": {
    "range": {
      "ip_address_range": {
        "gte": "192.168.0.100",
        "lte": "192.168.0.200"
      }
    }
  }
}

Object（对象）

应用场景： 嵌套对象，可以包含其他字段。
示例： 用户信息、地址信息等。

Object 类型是一个简单的嵌套对象，可以包含其他字段，但是这些字段是平等的，没有特定的关系。

适用于表示简单嵌套结构的文档，其中字段之间没有复杂的关联。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "user_info": {
        "type": "object",
        "properties": {
          "name": { "type": "text" },
          "age": { "type": "integer" },
          "email": { "type": "keyword" }
        }
      }
    }
  }
}

POST /my_index/_doc/1
{
  "user_info": {
    "name": "John Doe",
    "age": 30,
    "email": "john.doe@example.com"
  }
}

GET /my_index/_search
{
  "query": {
    "match": {
      "user_info.name": "John"
    }
  }
}

Nested（嵌套）

应用场景： 嵌套文档类型，适用于嵌套复杂结构的文档。
示例： 文章的评论、订单的商品列表等。

Nested 类型用于嵌套复杂结构的文档，并建立了父子关系，允许在子文档中使用独立的查询。

适用于表示具有层次结构和相互关联的文档，如文章的评论、订单的商品列表等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "comments": {
        "type": "nested",
        "properties": {
          "user": { "type": "keyword" },
          "comment": { "type": "text" }
        }
      }
    }
  }
}

POST /my_index/_doc/1
{
  "comments": [
    { "user": "user1", "comment": "Great article!" },
    { "user": "user2", "comment": "Interesting points." }
  ]
}

GET /my_index/_search
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "match": {
          "comments.comment": "article"
        }
      }
    }
  }
}

Flattened（扁平化）

应用场景： 将嵌套对象的字段扁平化，以便更容易进行搜索和分析。
示例： 扁平化订单信息中的商品详细信息。

Flattened 类型用于将嵌套对象的字段扁平化，以便更容易进行搜索和分析。它适用于具有深层次嵌套的复杂对象。

适用于需要对嵌套结构进行扁平化处理，以便于搜索和聚合。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "order_details": {
        "type": "flattened"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "order_details": {
    "product_name": "Smartphone",
    "price": 499.99,
    "quantity": 2
  }
}

GET /my_index/_search
{
  "query": {
    "match": {
      "order_details.product_name": "Smartphone"
    }
  }
}

Ip（IP 地址）

应用场景： 存储 IP 地址。
示例： 用户 IP 地址、服务器 IP 地址等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "ip_address": {
        "type": "ip"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "ip_address": "192.168.0.1"
}

GET /my_index/_search
{
  "query": {
    "term": {
      "ip_address": "192.168.0.1"
    }
  }
}

TokenCount（词条数量）

应用场景： 存储通过分析器处理后的词条数量。
示例： 文章中的单词数量、文档中的关键词数量等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "word_count": {
        "type": "token_count"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "word_count": 100
}

GET /my_index/_search
{
  "query": {
    "range": {
      "word_count": {
        "gte": 50
      }
    }
  }
}

Percolator（过滤器查询）

应用场景： 存储过滤器查询，用于实时检查文档与查询的匹配情况。
示例： 保存搜索过滤器，以便在新文档插入时检查匹配。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "filter_query": {
        "type": "percolator"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "filter_query": {
    "match": {
      "category": "Technology"
    }
  }
}

GET /my_index/_search
{
  "query": {
    "percolate": {
      "field": "filter_query",
      "document": {
        "category": "Technology"
      }
    }
  }
}

Search_As_You_Type（搜索即时建议）

应用场景： 支持搜索时的自动建议功能。
示例： 实时搜索建议、搜索框中的自动完成。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "suggest_field": {
        "type": "search_as_you_type"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "suggest_field": "Elasticsearch"
}

GET /my_index/_search
{
  "suggest": {
    "text": "Elasti",
    "my-suggestion": {
      "prefix": "true",
      "completion": {
        "field": "suggest_field"
      }
    }
  }
}

Rank_Feature、Rank_Features（排名特征、排名特征集）

应用场景： 存储数值特征，用于搜索和排序。
示例： 用户评分、商品评分等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "user_rating": {
        "type": "rank_feature"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "user_rating": 4.5
}

GET /my_index/_search
{
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "field_value_factor": {
        "field": "user_rating",
        "missing": 0
      }
    }
  }
}

Dense_Vector（稠密向量）

应用场景： 存储数值向量，通常用于机器学习等场景。
示例： 特征向量、嵌套向量表示等。

ini 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "feature_vector": {
        "type": "dense_vector"
      }
    }
  }
}

POST /my_index/_doc/1
{
  "feature_vector": [0.1, 0.5, 0.8]
}

GET /my_index/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.query_vector, 'feature_vector') + 1.0",
        "params": {
          "query_vector": [0.2, 0.3, 0.7]
        }
      }
    }
  }
}