从零开发短视频电商 OpenSearch/Elasticsearch 查询总结

文章目录

- [Match Queries（全文查询）](#Match Queries（全文查询）)
- [Term Queries（词项查询）](#Term Queries（词项查询）)
- [Range Queries（范围查询）](#Range Queries（范围查询）)
- [Wildcard Queries（通配符查询）](#Wildcard Queries（通配符查询）)
- [Fuzzy Queries（模糊查询）](#Fuzzy Queries（模糊查询）)
- [Prefix Queries（前缀查询）](#Prefix Queries（前缀查询）)
- [Nested Queries（嵌套查询）](#Nested Queries（嵌套查询）)
- [Exists Queries（存在查询）](#Exists Queries（存在查询）)
- [Boolean Queries（布尔/组合查询)](#Boolean Queries（布尔/组合查询))
- [Filter Queries（过滤查询）](#Filter Queries（过滤查询）)
- Aggregation（聚合查询）
- - [1. Terms Aggregation（词条聚合）](#1. Terms Aggregation（词条聚合）)
  - [2. Range Aggregation（范围聚合）](#2. Range Aggregation（范围聚合）)
  - [3. Date Histogram Aggregation（日期直方图聚合）](#3. Date Histogram Aggregation（日期直方图聚合）)
- [Script Query（脚本查询）](#Script Query（脚本查询）)
- [Search results （搜索结果）](#Search results （搜索结果）)
- - 分页
  - 排序
  - 高亮

OpenSearch 和 Elasticsearch 都是搜索和分析引擎，它们使用相似的查询语言。查询可以分为不同的类型，通常包括以下几类：

查询语言一般用DSL和DQL

查询特定领域语言 (DSL)：主要的 OpenSearch 查询语言，支持创建复杂的、完全可定制的查询。
仪表板查询语言 (DQL)：一种简单的基于文本的查询语言，用于过滤 OpenSearch 仪表板中的数据。

Match Queries（全文查询）

Match Queries (匹配查询): 这是最简单的查询类型之一，用于在文本字段中匹配指定的词语。有两种主要的匹配查询：match和match_phrase。

match查询会匹配文档中包含任意一个查询词的文档，而match_phrase查询要求文档中的词语必须按照查询的顺序完全匹配。

Match: 用于执行全文本搜索。
Match Phrase: 用于匹配短语。
Match Phrase Prefix: 用于匹配以给定短语开头的文档。
Match All: 匹配所有文档。

json 复制代码

GET /my_index/_search
{
	"query": {
		"match": { // 基本的match查询
			"content": "Elasticsearch"
		}
	}
}
// 约等于 LIKE '%Elasticsearch is%'
{
	"query": {
		"match_phrase": { // 用于确保匹配的词汇以相同的顺序出现在文本字段中。
			"title": "Elasticsearch is"
		}
	}
}
// 约等于 LIKE 'Elastic%'
{
	"query": {
		"match_phrase_prefix": { // 类似于match_phrase，但允许在词汇的后面包含通配符。
			"title": "Elastic"
		}
	}
}
// 约等于  OR
{
  "query": {
    "multi_match": { // 在多个字段中执行match查询。
      "query": "Elasticsearch",
      "fields": ["title", "content"]
    }
  }
}

// 相当于SQL中的SELECT * FROM index
{
  "query": {
    "match_all": {} // 返回所有文档
  }
}


// mappings 如下
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "standard"
      },
      "title": {
        "type": "text"
      }
    }
  }
}

Term Queries（词项查询）

Term Queries (词项查询): 词项查询用于精确匹配字段中的词项。它适用于关键字字段。例如，如果你想匹配一个确切的词语而不是文本中的部分内容，可以使用term查询。

json 复制代码

GET /my_index/_search
// SQL类比 SELECT * FROM my_index WHERE typeCode = 'software';
{
  "query": {
    "term": {
      "typeCode": "software"
    }
  }
}
// typeCode IN ('technology', 'software')
{
  "query": {
    "terms": { // 多值查询
      "typeCode": ["technology", "software"]
    }
  }
}


// mappings 如下
{
  "mappings": {
    "properties": {
      "typeCode": {
        "type": "keyword"
      }
    }
  }
}

match_phrase 和 term区别？

term查询： 主要用于关键字字段，它执行精确匹配。对于关键字字段而言，词项是不会被分词的，整个字段值作为一个单独的词项。

match_phrase查询： 主要用于文本字段，它要求文本字段按照查询的顺序完全匹配。它考虑了分词器的影响，确保匹配的词语在文本中是相邻的。

term查询： 不考虑分词器，直接匹配整个关键字字段的值。

term查询： 用于进行精确匹配，通常用于关键字字段。

match_phrase查询： 用于匹配文本字段中按顺序出现的一组词语，通常用于全文搜索和短语匹配。

Range Queries（范围查询）

Range Queries（范围查询） ：用于匹配在指定范围内的值。例如，你可以使用 range 查询来查找在一定范围内的日期或数字。

json 复制代码

// BETWEEN 25 AND 40
{
  "query": {
    "range": {
      "create_date": {
        "gte": "2022-01-01",
        "lte": "2022-12-31"
      }
    }
  }
}

{
  "mappings": {
    "properties": {
      "create_date": {
        "type": "date"
      }
    }
  }
}

Wildcard Queries（通配符查询）

Wildcard Query: 允许使用通配符（*和?）进行模糊匹配。

json 复制代码

// LIKE 'elast*'
{
  "query": {
    "wildcard": {
      "title": "elast*"
    }
  }
}
// LIKE 'jo%hn'
GET /my_index/_search
{
  "query": {
    "wildcard": {
      "username": "jo*hn"
    }
  }
}
//  LIKE 'john.d_e@example.com'
GET /my_index/_search
{
  "query": {
    "wildcard": {
      "email": "john.d?e@example.com"
    }
  }
}

Fuzzy Queries（模糊查询）

Fuzzy Query: 用于执行模糊匹配，允许一定程度的拼写错误或相似性。

json 复制代码

{
  "query": {
    "fuzzy": {
      "name": {
        "value": "elasticserch",
        "fuzziness": "AUTO"
      }
    }
  }
}

fuzziness参数用于定义模糊匹配的容忍度，即允许搜索词条与文档中的词条之间存在一定程度的编辑距离（编辑操作包括插入、删除、替换字符）。fuzziness值表示最大允许的编辑距离。

具体来说，fuzziness的值可以是以下之一：

0: 完全匹配，不允许任何编辑操作。
1: 允许一个编辑操作。
2: 允许两个编辑操作。
以此类推...

通过调整fuzziness参数，你可以控制模糊查询的严格程度，使其能够匹配在一定编辑距离内的相似词条。

Prefix Queries（前缀查询）

Prefix Query: 匹配字段中以指定前缀开头的文档。

json 复制代码

// LIKE 'jo%'
{
  "query": {
    "prefix": {
      "tag": "tech"
    }
  }
}

Nested Queries（嵌套查询）

Nested Query: 在嵌套文档中执行查询。

json 复制代码

// 创建索引与示例数据：
PUT /my_index
{
  "mappings": {
    "properties": {
      "people": {
        "type": "nested",
        "properties": {
          "name": {
            "type": "text"
          },
          "age": {
            "type": "integer"
          }
        }
      }
    }
  }
}

POST /my_index/_doc/1
{
  "people": [
    {"name": "John", "age": 30},
    {"name": "Alice", "age": 25}
  ]
}

POST /my_index/_doc/2
{
  "people": [
    {"name": "Bob", "age": 35},
    {"name": "Eve", "age": 28}
  ]
} 
//嵌套查询
// SELECT *
//	FROM my_index
//        JOIN unnest(people) AS person ON true
//  WHERE person.name = 'John' AND person.age >= 30;

GET /my_index/_search
{
  "query": {
    "nested": {
      "path": "people",
      "query": {
        "bool": {
          "must": [
            { "match": { "people.name": "John" }},
            { "range": { "people.age": { "gte": 30 }}}
          ]
        }
      }
    }
  }
}

Exists Queries（存在查询）

Exists Query: 匹配包含指定字段的文档。

json 复制代码

// age IS NOT NULL
{
  "query": {
    "exists": {
      "field": "age"
    }
  }
}

Boolean Queries（布尔/组合查询)

Boolean Queries（布尔查询） ：用于将多个查询组合在一起，以实现更复杂的查询逻辑。包括 must （与）、should（或）、must_not （非）等子句。

json 复制代码

{
  "query": {
    "bool": {
      "must": [
        { "match": { "field1": "value1" } },
        { "range": { "field2": { "gte": 10, "lte": 20 } } }
      ],
      "must_not": [
        { "term": { "field3": "value3" } }
      ],
      "should": [
        { "term": { "field4": "value4" } }
      ]
    }
  }
}

Filter Queries（过滤查询）

Filter Queries（过滤查询）：类似于布尔查询，但主要用于过滤文档而不会影响打分。它们通常用于精确匹配或范围过滤。

json 复制代码

{
  "query": {
    "bool": {
      "filter": [
        { "term": { "field1": "value1" } },
        { "range": { "field2": { "gte": 10, "lte": 20 } } }
      ]
    }
  }
}

Aggregation（聚合查询）

Aggregation Query（聚合查询）：用于对数据执行聚合操作，例如计算平均值、总和、最小值、最大值等。

假设我们有一个商品的文档集合，每个文档包含商品的信息，如商品名称（name）、商品类别（category）、价格（price）和销售时间戳（timestamp）等。

索引映射 (Index Mapping):

json 复制代码

PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "category": {
        "type": "keyword"
      },
      "price": {
        "type": "float"
      },
      "timestamp": {
        "type": "date"
      }
    }
  }
}

样本文档 (Sample Documents):

json 复制代码

POST /products/_bulk
{"index":{}}
{"name": "Laptop", "category": "Electronics", "price": 1200.00, "timestamp": "2023-01-01T08:00:00"}
{"index":{}}
{"name": "Camera", "category": "Electronics", "price": 500.00, "timestamp": "2023-01-01T09:30:00"}
{"index":{}}
{"name": "Backpack", "category": "Fashion", "price": 50.00, "timestamp": "2023-01-02T12:45:00"}
{"index":{}}
{"name": "Headphones", "category": "Electronics", "price": 100.00, "timestamp": "2023-01-02T14:15:00"}
{"index":{}}
{"name": "Jacket", "category": "Fashion", "price": 150.00, "timestamp": "2023-01-03T16:30:00"}

1. Terms Aggregation（词条聚合）

商品按类别分组

json 复制代码

GET /products/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "terms": {
        "field": "category"
      }
    }
  }
}

结果：

json 复制代码

{
  "aggregations": {
    "categories": {
      "buckets": [
        {"key": "Electronics", "doc_count": 3},
        {"key": "Fashion", "doc_count": 2}
      ]
    }
  }
}

2. Range Aggregation（范围聚合）

商品价格范围

json 复制代码

GET /products/_search
{
  "size": 0,
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          {"from": 0, "to": 100},
          {"from": 100, "to": 200},
          {"from": 200, "to": 1000}
        ]
      }
    }
  }
}

结果：

json 复制代码

{
  "aggregations": {
    "price_ranges": {
      "buckets": [
        {"key": "0.0-100.0", "doc_count": 3},
        {"key": "100.0-200.0", "doc_count": 1},
        {"key": "200.0-1000.0", "doc_count": 1}
      ]
    }
  }
}

3. Date Histogram Aggregation（日期直方图聚合）

商品按月份分组

json 复制代码

GET /products/_search
{
  "size": 0,
  "aggs": {
    "monthly_activity": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "month",
        "format": "yyyy-MM-dd"
      }
    }
  }
}

结果：

json 复制代码

{
  "aggregations": {
    "monthly_activity": {
      "buckets": [
        {"key_as_string": "2023-01-01", "key": 1640995200000, "doc_count": 2},
        {"key_as_string": "2023-02-01", "key": 1643673600000, "doc_count": 2},
        {"key_as_string": "2023-03-01", "key": 1646140800000, "doc_count": 1}
      ]
    }
  }
}

Script Query（脚本查询）

Script Query（脚本查询）：允许使用脚本来执行自定义查询逻辑。

json 复制代码

{
  "query": {
    "script": {
      "script": {
        "source": "doc['popularity'].value > params.threshold",
        "params": {
          "threshold": 1000
        }
      }
    }
  }
}

Search results （搜索结果）

分页

有几种方式可以进行搜索结果的分页，每种方式都有其优缺点。以下是一些常见的分页方式：

1.From/Size:

from参数是要从哪个文档开始显示结果的文档编号。

size参数是要显示的结果数量，它们共同允许你返回搜索结果的子集。

使用from和size参数时，对深度分页可能有性能影响，OpenSearch限制这种方法最多返回10,000个结果。

优点： 简单明了，适用于小型数据集的分页。
缺点： 对于大型数据集，性能可能下降，因为 Elasticsearch 需要跳过大量文档来获取所需的页。

2.Scroll API:

scroll操作在一定时间内保持搜索上下文，不受数据变化的影响。

优点： Scroll API 允许在持续的时间内保持搜索上下文，允许客户端按需检索更多结果。适用于大型数据集的分页。
缺点： 需要消耗服务器资源，因为服务器需要维护搜索上下文，可能不适合高并发的应用。

3.Search After:

Search After 分页是一种基于上一页结果中的最后一个文档的排序值（sort value）来获取下一页结果的分页方式。

与scroll操作不同，search_after参数是无状态的，因此文档的顺序可能会因为文档的添加或删除而改变。

优点： 使用游标的方式，允许在结果集中移动到下一页，不需要保持搜索上下文，适用于分布式搜索。
缺点： 需要客户端维护上一页的最后一个文档的排序值。

Point in Time（PIT）与search_after结合是OpenSearch中首选的分页方法，尤其适用于深度分页。

PIT在一个冻结的时间点上操作，不受查询的限制，支持一致的前后翻页。

PIT时间点介绍

PIT允许你对一个固定时间点的数据集运行不同的查询。
正常情况下，多次在索引上运行相同的查询可能返回不同的结果，因为文档不断地被索引、更新和删除。
使用PIT，你可以保留数据的状态，即使数据在查询期间被修改。
主要用途是与search_after功能结合，用于深度分页搜索结果。

限制：

Scroll API：
- 搜索结果在请求时的时间点被冻结，但与特定查询绑定。
- Scroll只能在搜索中向前滚动，如果请求某页失败，后续请求将跳过该页，返回下一页。
from和size参数：
- 搜索结果的时间点不被冻结，可能因为文档的索引或删除而不一致。
- 不推荐用于深度分页，因为每个页面请求需要处理所有结果并过滤它们以获取请求的页面。
search_after：
- 搜索结果的时间点不被冻结，可能因为并发文档的索引或删除而不一致。

推荐方式：

在大多数情况下，推荐使用 Search After 或 From/Size。选择哪种方式通常取决于你的具体需求和性能考虑。

如果数据集较小，且性能没有太大问题，可以考虑使用 From/Size。
如果数据集较大，或者需要支持高并发的搜索，可以考虑使用 Search After，这样可以更好地处理大量数据，并且不会对服务器资源产生过大的压力。

需要注意的是，对于大型数据集的分页，最好考虑使用游标式的分页（如 Search After 或 Scroll），以避免性能问题。而对于小型数据集，简单的 From/Size 方式可能更为方便。

1. Search After 分页示例

json 复制代码

// 第一次请求
GET /your_index/_search
{
  "query": {
    "match_all": {}
  },
  "size": 3,
  "sort": [
    {"_uid": "asc"}  // 使用 _uid 排序确保唯一性
  ]
}

{
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "hits": [
      {"_source": {"title": "Document 1"}, "sort": ["_uid:1"]},
      {"_source": {"title": "Document 2"}, "sort": ["_uid:2"]},
      {"_source": {"title": "Document 3"}, "sort": ["_uid:3"]}
    ]
  }
}
// 第二次请求
GET /your_index/_search
{
  "query": {
    "match_all": {}
  },
  "size": 3,
  "sort": [
    {"_uid": "asc"}
  ],
  "search_after": ["_uid:3"]  // 上一页结果中最后一个文档的排序值
}

{
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "hits": [
      {"_source": {"title": "Document 4"}, "sort": ["_uid:4"]},
      {"_source": {"title": "Document 5"}, "sort": ["_uid:5"]},
      {"_source": {"title": "Document 6"}, "sort": ["_uid:6"]}
    ]
  }
}

使用 Search After 分页时，需要在每个请求中提供上一页结果中最后一个文档的排序值。这个值是一个数组，包含了用于排序的字段值。在上述示例中，使用 _uid 字段作为排序字段确保了唯一性。

请注意，使用 Search After 分页通常比 Scroll API 更为高效，尤其对于大数据集。

2. from size 分页示例

json 复制代码

// 第一次请求
GET /your_index/_search
{
  "query": {
    "match_all": {}
  },
  "from": 0,
  "size": 3
}

{
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "hits": [
      {"_source": {"title": "Document 1"}},
      {"_source": {"title": "Document 2"}},
      {"_source": {"title": "Document 3"}}
    ]
  }
}

// 第二次请求
GET /your_index/_search
{
  "query": {
    "match_all": {}
  },
  "from": 3,
  "size": 3
}

{
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "hits": [
      {"_source": {"title": "Document 4"}},
      {"_source": {"title": "Document 5"}},
      {"_source": {"title": "Document 6"}}
    ]
  }
}

3. Scroll 分页示例

json 复制代码

// 第一次请求
POST /your_index/_search?scroll=1m
{
  "query": {
    "match_all": {}
  },
  "size": 3
}

{
  "_scroll_id": "your_scroll_id", // OpenSearch返回一个scroll ID，通过该ID可以按批次获取结果。
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "hits": [
      {"_source": {"title": "Document 1"}},
      {"_source": {"title": "Document 2"}},
      {"_source": {"title": "Document 3"}}
    ]
  }
}

// 第二次请求
POST /_search/scroll
{
  "scroll": "1m",
  "scroll_id": "your_scroll_id"
}

{
  "_scroll_id": "your_new_scroll_id",
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "hits": [
      {"_source": {"title": "Document 4"}},
      {"_source": {"title": "Document 5"}},
      {"_source": {"title": "Document 6"}}
    ]
  }
}

排序

默认情况下，全文查询按相关性分数对结果进行排序。可以通过设置order参数为asc或desc，按升序或降序对任何字段值进行排序。

例如，通过以下查询以Henry IV剧名为条件，按line_id字段值降序排序：

json 复制代码

GET shakespeare/_search
{
  "query": {
    "term": {
      "play_name": {
        "value": "Henry IV"
      }
    }
  },
  "sort": [
    {
      "line_id": {
        "order": "desc"
      }
    }
  ]
}

排序参数是一个数组，可以按优先级指定多个字段值。

如果有两个字段对于line_id具有相同的值，OpenSearch将使用speech_number进行第二次排序：

json 复制代码

GET shakespeare/_search
{
  "query": {
    "term": {
      "play_name": {
        "value": "Henry IV"
      }
    }
  },
  "sort": [
    {
      "line_id": {
        "order": "desc"
      }
    },
    {
      "speech_number": {
        "order": "desc"
      }
    }
  ]
}

可以继续按任意数量的字段值进行排序，不仅限于数值字段，还可以按照日期或时间戳字段进行排序。

如果要排序文本字段，需要使用关键字类型的文本字段的原始版本。例如，通过play_name.keyword进行排序：

json 复制代码

GET shakespeare/_search
{
  "query": {
    "term": {
      "play_name": {
        "value": "Henry IV"
      }
    }
  },
  "sort": [
    {
      "play_name.keyword": {
        "order": "desc"
      }
    }
  ]
}

可以使用search_after参数结合sort参数进行更高效的滚动。确保search_after数组中的值与sort数组中的值相同且顺序一致。

排序模式(sort mode)适用于按数组或多值字段排序。它指定应选择哪个数组值来对文档进行排序。对于包含数字数组的数值字段，可以按照平均值(avg)、总和(sum)或中位数(median)模式进行排序。对于数字和字符串数据类型，可以使用最小值(min)或最大值(max)模式进行排序。

缺失值(missing values)参数指定对缺失值的处理方式。可以设置为_last（将缺失值的文档列在最后）或_first（将缺失值的文档列在最前）。默认值是_last。还可以指定用于缺失文档的自定义值作为排序值。

如果字段没有映射，可以使用unmapped_type参数，指示OpenSearch忽略该字段。这在处理未映射字段时很有用。

高亮

高亮（Highlighting）可以用于突出显示搜索结果中匹配查询条件的部分。

请注意，为了使高亮工作，字段需要被正确地映射为text类型，并且在创建索引时要启用高亮设置。

json 复制代码

PUT /my_index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text", // content字段同时包含text类型和keyword类型，以便我们可以在查询中使用content.keyword来进行排序等操作。
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

// 使用match查询来匹配包含关键词"Elasticsearch"的文档。然后，通过highlight参数指定了要高亮显示的字段，这里是content字段。
GET /my_index/_search
{
  "query": {
    "match": {
      "content": "Elasticsearch"
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

{
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "hits": [
      {
        "_source": {
          "id": 1,
          "name": "John",
          "content": "Elasticsearch is a powerful search engine."
        },
        "highlight": {
          "content": [
            "<em>Elasticsearch</em> is a powerful search engine."
          ]
        }
      },
      {
        "_source": {
          "id": 3,
          "name": "Bob",
          "content": "<em>Elasticsearch</em> is part of the ELK Stack."
        },
        "highlight": {
          "content": [
            "<em>Elasticsearch</em> is part of the ELK Stack."
          ]
        }
      }
    ]
  }
}