💝💝💝欢迎来到我的博客,很高兴能够在这里和您见面!希望您在这里可以感受到一份轻松愉快的氛围,不仅可以获得有趣的内容和知识,也可以畅所欲言、分享您的想法和见解。
- 推荐:kwan 的首页,持续学习,不断总结,共同进步,活到老学到老
- 导航
非常期待和您一起在这个小小的网络世界里共同探索、学习和成长。💝💝💝 ✨✨ 欢迎订阅本专栏 ✨✨
博客目录
-
-
- [1.评分机制 TF\IDF](#1.评分机制 TF\IDF)
- [2.score 是如何被计算出来的](#2.score 是如何被计算出来的)
- 3.分析如何被匹配上
- [4.Doc value](#4.Doc value)
- [5.query phase](#5.query phase)
- [6.replica shard 提升吞吐量](#6.replica shard 提升吞吐量)
- [7.fetch phbase 工作流程](#7.fetch phbase 工作流程)
- 8.搜索参数小总结
- [9.bucket 和 metric](#9.bucket 和 metric)
-
1.评分机制 TF\IDF
TF-IDF
(Term Frequency-Inverse Document Frequency)是一种用于信息检索和文本挖掘的统计方法,用以评估一个词在一个文档集中一个特定文档的重要程度。这个评分机制考虑了一个词语在特定文档中的出现频率(Term Frequency,TF)和在整个文档集中的逆文档频率(Inverse Document Frequency,IDF)。
TF(Term Frequency)
词频(Term Frequency,TF)表示一个词在一个特定文档
中出现的频率。这通常是该词在文档中出现次数与文档的总词数之比。
IDF(Inverse Document Frequency)
逆文档频率(Inverse Document Frequency,IDF)是一个词在文档集
中的重要性的度量。如果一个词很常见,出现在很多文档中(例如"和","是"等),那么它可能不会携带有用的信息。IDF 度量就是为了降低这些常见词在文档相似性度量中的权重。
2.score 是如何被计算出来的
apl
GET /book/_search?explain=true
{
"query": {
"match": {
"description": "java程序员"
}
}
}
返回
json
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 2.137549,
"hits": [
{
"_shard": "[book][0]",
"_node": "MDA45-r6SUGJ0ZyqyhTINA",
"_index": "book",
"_type": "_doc",
"_id": "3",
"_score": 2.137549,
"_source": {
"name": "spring开发基础",
"description": "spring 在java领域非常流行,java程序员都在用。",
"studymodel": "201001",
"price": 88.6,
"timestamp": "2019-08-24 19:11:35",
"pic": "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
"tags": ["spring", "java"]
},
"_explanation": {
"value": 2.137549,
"description": "sum of:",
"details": [
{
"value": 0.7936629,
"description": "weight(description:java in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.7936629,
"description": "score(freq=2.0), product of:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 0.47000363,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 3,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.7675597,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 2.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 12.0,
"description": "dl, length of field",
"details": []
},
{
"value": 35.333332,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 1.3438859,
"description": "weight(description:程序员 in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1.3438859,
"description": "score(freq=1.0), product of:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 0.98082924,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 1,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 3,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.6227967,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 12.0,
"description": "dl, length of field",
"details": []
},
{
"value": 35.333332,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
},
{
"_shard": "[book][0]",
"_node": "MDA45-r6SUGJ0ZyqyhTINA",
"_index": "book",
"_type": "_doc",
"_id": "2",
"_score": 0.57961315,
"_source": {
"name": "java编程思想",
"description": "java语言是世界第一编程语言,在软件开发领域使用人数最多。",
"studymodel": "201001",
"price": 68.6,
"timestamp": "2019-08-25 19:11:35",
"pic": "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
"tags": ["java", "dev"]
},
"_explanation": {
"value": 0.57961315,
"description": "sum of:",
"details": [
{
"value": 0.57961315,
"description": "weight(description:java in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.57961315,
"description": "score(freq=1.0), product of:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 0.47000363,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 3,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.56055,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 19.0,
"description": "dl, length of field",
"details": []
},
{
"value": 35.333332,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
}
]
}
}
3.分析如何被匹配上
分析一个 document 是如何被匹配上的
- 最终得分
- IDF 得分
apl
GET /book/_explain/3
{
"query": {
"match": {
"description": "java程序员"
}
}
}
4.Doc value
搜索的时候,要依靠倒排索引;排序的时候,需要依靠正排索引,看到每个 document 的每个 field,然后进行排序,所谓的正排索引,其实就是 doc values
在建立索引的时候,一方面会建立倒排索引,以供搜索用;一方面会建立正排索引,也就是 doc values,以供排序,聚合,过滤等操作使用
doc values 是被保存在磁盘上的,此时如果内存足够,os 会自动将其缓存在内存中,性能还是会很高;如果内存不足够,os 会将其写入磁盘上
倒排索引
doc1: hello world you and me
doc2: hi, world, how are you
term | doc1 | doc2 |
---|---|---|
hello | * | |
world | * | * |
you | * | * |
and | * | |
me | * | |
hi | * | |
how | * | |
are | * |
搜索时:
hello you --> hello, you
hello --> doc1
you --> doc1,doc2
doc1: hello world you and me
doc2: hi, world, how are you
sort by 出现问题
正排索引
doc1: { "name": "jack", "age": 27 }
doc2: { "name": "tom", "age": 30 }
document | name | age |
---|---|---|
doc1 | jack | 27 |
doc2 | tom | 30 |
5.query phase
-
搜索请求发送到某一个 coordinate node,构构建一个 priority queue,长度以 paging 操作 from 和 size 为准,默认为 10
-
coordinate node 将请求转发到所有 shard,每个 shard 本地搜索,并构建一个本地的 priority queue
-
各个 shard 将自己的 priority queue 返回给 coordinate node,并构建一个全局的 priority queue
6.replica shard 提升吞吐量
replica shard 如何提升搜索吞吐量
一次请求要打到所有 shard 的一个 replica/primary 上去,如果每个 shard 都有多个 replica,那么同时并发过来的搜索请求可以同时打到其他的 replica 上去
7.fetch phbase 工作流程
-
coordinate node 构建完 priority queue 之后,就发送 mget 请求去所有 shard 上获取对应的 document
-
各个 shard 将 document 返回给 coordinate node
-
coordinate node 将合并后的 document 结果返回给 client 客户端
一般搜索,如果不加 from 和 size,就默认搜索前 10 条,按照_score 排序
8.搜索参数小总结
preference:
决定了哪些 shard 会被用来执行搜索操作
_primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3
bouncing results 问题,两个 document 排序,field 值相同;不同的 shard 上,可能排序不同;每次请求轮询打到不同的 replica shard 上;每次页面上看到的搜索结果的排序都不一样。这就是 bouncing result,也就是跳跃的结果。
搜索的时候,是轮询将搜索请求发送到每一个 replica shard(primary shard),但是在不同的 shard 上,可能 document 的排序不同
解决方案就是将 preference 设置为一个字符串,比如说 user_id,让每个 user 每次搜索的时候,都使用同一个 replica shard 去执行,就不会看到 bouncing results 了
timeout:
主要就是限定在一定时间内,将部分获取到的数据直接返回,避免查询耗时过长
routing:
document 文档路由,_id 路由,routing=user_id,这样的话可以让同一个 user 对应的数据到一个 shard 上去
search_type:
default:query_then_fetch
dfs_query_then_fetch,可以提升 revelance sort 精准度
9.bucket 和 metric
bucket:一个数据分组
city name
北京 张三
北京 李四
天津 王五
天津 赵六
天津 王麻子
划分出来两个 bucket,一个是北京 bucket,一个是天津 bucket
北京 bucket:包含了 2 个人,张三,李四
上海 bucket:包含了 3 个人,王五,赵六,王麻子
metric:对一个数据分组执行的统计
metric,就是对一个 bucket 执行的某种聚合分析的操作,比如说求平均值,求最大值,求最小值
sql
select count(*) from book group by studymodel
bucket
:group by studymodel --> 那些 studymodel 相同的数据,就会被划分到一个 bucket 中metric
:count(*),对每个 user_id bucket 中所有的数据,计算一个数量。还有 avg(),sum(),max(),min()
觉得有用的话点个赞
👍🏻
呗。❤️❤️❤️本人水平有限,如有纰漏,欢迎各位大佬评论批评指正!😄😄😄
💘💘💘如果觉得这篇文对你有帮助的话,也请给个点赞、收藏下吧,非常感谢!👍 👍 👍
🔥🔥🔥Stay Hungry Stay Foolish 道阻且长,行则将至,让我们一起加油吧!🌙🌙🌙