Elasticsearch Interval 查询:为什么它们是真正的位置查询,以及如何从 Span 转换

作者:来自 Elastic Mayya Sharipova

解释 span 查询如何成为真正的位置查询以及如何从 span 查询过渡到它们。

长期以来,Span 查询一直是有序和邻近搜索的工具。这些查询对于特定领域(例如法律或专利搜索)尤其有用。但相对较新的 Interval 查询实际上更适合这项工作。与 Span 查询不同,Interval 查询是真正的位置查询,仅根据位置邻近性对文档进行评分(下文将对此进行扩展)。

从 Elasticsearch v8.16 开始,我们将 Interval 查询与 Span 查询进行了对比。具体来说:

  • Interval 查询现在支持 "range" 和 "regexp" 规则。
  • 与 Span 查询类似的基于多个术语的间隔规则可以扩展到 indices.query.bool.max_clause_count 术语,而不是之前的 128 这个值。

我们未来的计划是弃用 Span 查询,转而使用 Interval 查询,它涵盖相同的功能,但以更用户友好的方式进行。

更多阅读:Elasticsearch:使用 intervals query - 根据匹配项的顺序和接近度返回文档

Interval 查询相对于 span 查询的优势

Interval 查询根据匹配术语的顺序和接近度对文档进行排名。Interval 查询的一些优势:

  • 真正的位置(positional)查询
  • 基于学术研究,基于最小区间语义(minimal interval semantics )论文,具有经过验证的算法,这些算法与位置数量成线性关系
  • 语法更简单
  • 速度稍快(无需根据语料库统计数据计算分数)
  • 能够使用脚本进行专门的用例

Interval 查询是真正的位置查询,在对文档进行评分时仅考虑位置信息(分数与 interval 长度成反比)。这与 span 查询不同,跨度查询还考虑 TF-IDF 等标准指标。以下示例说明了区间查询如何进行更好的排名。

复制代码
PUT docs
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      }
    }
  }
}


PUT docs/_doc/1
{
  "content" : "She sells beautiful seashells by the seashore, their smooth shapes shining in the sun, catching the light with every curve. The girl's bright smile is just as inviting, drawing people in as they stop to admire the shells, each one a little piece of the ocean she loves. Her gentle voice, like the sound of the waves, adds to the peaceful charm of the moment."
}

PUT docs/_doc/2
{
  "content" : "She plays; her father sells seashells. "
}

我们希望找到 "she" 一词与 "sells" 一词相近的文档。所需的排名将返回第一个文档,然后是第二个文档,因为这两个词在第一个文档中出现得比在第二个文档中更接近。

但是,如果我们运行 Span 查询,我们将得到不同的排名:[doc2, doc1],因为 Span 查询除了邻近度计算外,还结合了语料库统计数据,例如 TF 和 IDF 指标,这些指标会仅根据邻近度扭曲排名。

复制代码
GET docs/_search?explain=true
{
  "query": {
    "span_near": {
      "clauses": [
        {
          "span_term": {
            "content": "she"
          }
        },
        {
          "span_term": {
            "content": "sells"
          }
        }
      ],
      "slop": 10,
      "in_order": true
    }
  }
}

相比之下,区间查询根据接近度计算分数,而不考虑语料库统计信息和文档长度。我们将得到所需的排名:[doc1,doc2]。

复制代码
GET docs/_search?explain=true
{
  "query": {
    "intervals": {
      "content": {
        "match": {
          "query": "she sells",
          "max_gaps": 10,
          "ordered" : true
        }
      }
    }
  }
}

这使得 interval 查询成为真正邻近查询的理想选择。

Interval 查询允许提取邻近度得分作为整体相关性得分的信号。它们经过优化,可以与其他相关性信号(如 BM25)混合使用,例如:

复制代码
GET docs/_search
{
    "query": {
        "bool": {
            "must": {
                "match": {
                    "content": {
                        "query": "she sells",
                        "boost": "{{bm25_boost}}"
                    }
                }
            },
            "should": {
                "intervals": {
                    "content": {
                        "match": {
                            "query": "she sells",
                            "max_gaps": 10
                        },
                        "boost": "{{proximity_boost}}"
                    }
                }
            }
        }
    }
}

请注意 上面的两个参数:bm25_boost, proximity_boost。它们的用法是 search template。你可以分别使用不同的权重来进行调节。

请注意,这也可以应用于重新评分:我们可以单独使用 BM25 进行第一次传递,然后添加具有 BM25 + interval 组合的重新评分器。

请注意,如果我们需要通过 BM25 和接近度对 Span 查询在匹配和评分中的行为进行建模,我们可以通过将 interval 查询与 BM25 查询组合为布尔查询中的必备子句,并设置适当的 boosts 来实现。

过渡指南

下面我们展示了从以下 Span 查询过渡到等效 Interval 查询的方法:

  • span_containing

  • span_field_masking

  • span_first

  • span_multi

  • span_near

  • span_not

  • span_or

  • span_term

  • span_within

    PUT parks
    {
    "mappings": {
    "properties": {
    "park": {
    "type": "text"
    },
    "park_rules": {
    "type": "text"
    }
    }
    }
    }

    PUT parks/_doc/1
    {
    "park" : "Sunny Meadows Park",
    "park_rules" : "Children are encouraged to enjoy our playground equipment, including slides, swings, and climbing structures. Feeding the ducks and fish in the pond is allowed, but only with approved feed available at the park office. Children are not permitted to climb trees or enter the park's fountains and water features. Please do not bring glass containers, sharp objects, or personal sports equipment into the park."
    }

    PUT parks/_doc/2
    {
    "park" : "Greenwood Forest Park",
    "park_rules" : "Children are welcome to explore our nature trails, participate in organized activities, and use the designated picnic areas. Picking flowers, disturbing wildlife, or leaving the designated trails is not allowed. Children must be accompanied by an adult when using the park's grills and fire pits. Please refrain from bringing pets, bicycles, or scooters into the park."
    }

    PUT parks/_doc/3
    {
    "park" : "Happy Haven Playground",
    "park_rules" : "Children can enjoy our sandbox, jungle gym, and seesaws, as well as participate in organized games and activities. Running, shouting, or playing rough games near the playground equipment is not permitted. Children must be supervised by an adult at all times and should use the equipment according to their age and size. Please do not bring food, drinks, or chewing gum into the playground area."
    }

    PUT parks/_doc/4
    {
    "park" : "Lakeside Recreation Park",
    "park_rules" : "Children can enjoy fishing at the lake with an adult, using the sports fields for organized games, and playing in the designated play areas. Swimming, wading, or boating in the lake is strictly prohibited. Children must wear appropriate safety gear when using the sports fields and play equipment. Please do not bring alcohol, tobacco products, or illegal substances into the park."
    }

    PUT parks/_doc/5
    {
    "park" : "Adventure Land Park",
    "park_rules" : "Children are encouraged to use our zip lines, ropes courses, and climbing walls under adult supervision and with proper safety equipment. Running, pushing, or engaging in horseplay near the adventure equipment is not allowed. Children must follow all height, weight, and age restrictions for each activity. Please do not bring personal items, such as cell phones or cameras, onto the adventure equipment."
    }

SPAN NEAR

复制代码
GET parks/_search
{
  "query": {
    "span_near": {
      "clauses": [
        {
          "span_term": {
            "park_rules": "prohibited"
          }
        },
        {
          "span_term": {
            "park_rules": "swimming"
          }
        }
      ],
      "slop": 10,
      "in_order": false
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {}
    }
  }
}

GET parks/_search
{
  "query": {
    "intervals": {
      "park_rules": {
        "match": {
          "query": "swimming prohibited",
          "max_gaps": 10,
          "ordered" : false  
        }
      }
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {}
    }
  }
}

SPAN FIRST

复制代码
GET parks/_search
{
  "query": {
    "span_first": {
      "match": {
        "span_term": { "park_rules": "sandbox" }
      },
      "end": 5
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {}
    }
  }
}


GET parks/_search
{
  "query": {
    "intervals" : {
      "park_rules" : {
        "match" : {
          "query" : "sandbox",
          "filter" : {
            "script" : {
              "source" : "interval.end < 5"
            }
          }
        }
      }
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {}
    }
  }
}
复制代码

SPAN OR

复制代码
GET parks/_search
{
  "query": {
    "span_or" : {
      "clauses" : [
        { "span_term" : { "park_rules" : "prohibited" } },
        { "span_near": {"clauses": [{"span_term": {"park_rules": "not"}}, {"span_term": {"park_rules": "allowed"}}], "in_order": true}},
        { "span_near": {"clauses": [{"span_term": {"park_rules": "not"}}, {"span_term": {"park_rules": "permitted"}}], "in_order": true}}
      ]
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {}
    }
  }
}

GET parks/_search
{
  "query": {
    "intervals" : {
      "park_rules" : {
        "any_of" : {
          "intervals" : [
            { "match" : { "query" : "prohibited"} },
            { "match" : { "query" : "not allowed", "ordered" : true } },
            { "match" : { "query" : "not permitted", "ordered" : true } }
           ]
        }
      }
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {}
    }
  }
}

SPAN CONTAINING

复制代码
GET parks/_search
{
  "query": {
    "span_containing": {
      "little": {
        "span_term": {
          "park_rules": "sports"
        }
      },
      "big": {
        "span_near": {
          "clauses": [
            {
              "span_term": {
                "park_rules": "children"
              }
            },
            {
              "span_term": {
                "park_rules": "park"
              }
            }
          ],
          "slop": 50,
          "in_order": false
        }
      }
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {}
    }
  }
}

GET parks/_search
{
  "query": {
    "intervals": {
      "park_rules": {
        "match": {
          "query": "children park",
          "max_gaps": 50,
          "filter" : {
            "containing" : {
              "match" : {
                "query" : "sports"
              }
            }
          }
        }
      }
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {}
    }
  }
}

SPAN WITHIN

复制代码
GET parks/_search
{
  "query": {
    "span_within": {
      "little": {
        "span_term": {
          "park_rules": "sports"
        }
      },
      "big": {
        "span_near": {
          "clauses": [
            {
              "span_term": {
                "park_rules": "children"
              }
            },
            {
              "span_term": {
                "park_rules": "park"
              }
            }
          ],
          "slop": 50,
          "in_order": false
        }
      }
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {
      }
    }
  }
}

GET parks/_search
{
  "query": {
    "intervals": {
      "park_rules": {
        "match": {
          "query": "sports",
          "filter" : {
            "contained_by" : {
              "match" : {
                "query" : "children park",
                "max_gaps": 50
              }
            }
          }
        }
      }
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {
        "number_of_fragments": 0
      }
    }
  }
}

SPAN NOT

复制代码
GET parks/_search
{
  "query": {
    "span_not": {
      "include": {
        "span_term": { "park_rules": "allowed" }
      },
      "exclude": {
        "span_near": {
          "clauses": [
            { "span_term": { "park_rules": "not" } },
            { "span_term": { "park_rules": "allowed" } }
          ],
          "slop": 0,
          "in_order": true
        }
      }
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {}
    }
  }
}

GET parks/_search
{
  "query": {
    "intervals": {
      "park_rules": {
        "match": {
          "query": "allowed",
          "filter": {
            "not_contained_by": {
              "match": {
                "query": "not allowed",
                "max_gaps": 0,
                "ordered" : true
              }
            }
          }
        }
      }
    }
  },
  "highlight": {
    "fields": {
      "park_rules": {}
    }
  }
}

SPAN_MULTI

wildcard

复制代码
GET parks/_search
{
    "query": {
        "span_multi": {
            "match": {
                "wildcard": {
                    "park_rules": {"value": "sand*" }
                }
            }
        }
    }
}

GET parks/_search
{
    "query": {
        "intervals": {
            "park_rules": {
                "wildcard": {
                    "pattern": "sand*"
                }
            }
        }
    }
}

fuzzy

复制代码
GET parks/_search
{
    "query": {
        "span_multi": {
            "match": {
                "fuzzy": {
                    "park_rules": {"value": "sandbo" }
                }
            }
        }
    }
}

GET parks/_search
{
    "query": {
        "intervals": {
            "park_rules": {
                "fuzzy": {
                    "term": "sandbo"
                }
            }
        }
    }
}

prefix

复制代码
GET parks/_search
{
    "query": {
        "span_multi": {
            "match": {
                "prefix": {
                    "park_rules": {"value": "sandbo" }
                }
            }
        }
    }
}

GET parks/_search
{
    "query": {
        "intervals": {
            "park_rules": {
                "prefix": {
                    "prefix": "sandbo"
                }
            }
        }
    }
}

regexp

复制代码
GET parks/_search
{
    "query": {
        "span_multi": {
            "match": {
                "regexp": {
                    "park_rules": {"value": "sand.*" }
                }
            }
        }
    }
}

GET parks/_search
{
    "query": {
        "intervals": {
            "park_rules": {
                "regexp": {
                    "pattern": "sand.*"
                }
            }
        }
    }
}

range

复制代码
GET parks/_search
{
    "query": {
        "span_multi": {
            "match": {
                "range": {
                    "park": {
                        "gte" : "a",
                        "lte": "h"
                    }
                }
            }
        }
    }
}

GET parks/_search
{
    "query": {
        "intervals": {
            "park": {
                "range": {
                    "gte" : "a",
                    "lte" : "h"
                }
            }
        }
    }
}

span_field_masking

使用 Intervals 的 use_field

复制代码
GET parks/_search
{
  "query": {
    "span_near": {
      "clauses": [
        {
          "span_term": {
            "park_rules": "nature"
          }
        },
        {
          "span_field_masking": {
            "query": {
              "span_term": {
                "park_rules.stemmed": "trail"
              }
            },
            "field": "park_rules" 
          }
        }
      ],
      "slop": 5
    }
  }
}


GET parks/_search
{
  "query": {
    "intervals" : {
      "park_rules" : {
        "all_of" : {
          "ordered" : true,
          "max_gaps" : 5, 
          "intervals" : [
            {
              "match" : {
                "query" : "nature"
              }
            },
            {
              "match" : {
                "query" : "trail",
                "use_field" : "park_rules.stemmed"
              }
            }
          ]
        }
      }
    }
  }
}

结论

间隔查询是进行真正位置搜索的强大工具。从 8.16 版本开始,使用扩展功能进行尝试。

准备好自己尝试了吗?开始免费试用

想要获得 Elastic 认证?了解下一期 Elasticsearch 工程师培训何时举行!

原文:Interval queries: why they are true positional queries, and how to transition from Span - Search Labs

相关推荐
viperrrrrrrrrr729 分钟前
大数据学习(96)-Hive面试题
大数据·hive·学习
程序猿阿伟29 分钟前
《SQL赋能人工智能:解锁特征工程的隐秘力量》
数据库·人工智能·sql
csssnxy1 小时前
叁仟数智指路机器人是否支持远程监控和管理?
大数据·人工智能
冰箱里的金鱼1 小时前
MYSQL 存储引擎 和 日志
数据库
Yan-英杰2 小时前
【百日精通JAVA | SQL篇 | 第三篇】 MYSQL增删改查
java·数据库·sql
信徒_2 小时前
Mysql 中的 binlog、redolog、undolog
数据库·mysql
极限实验室2 小时前
代理 Elasticsearch 服务:INFINI Gateway VS Nginx
数据库·搜索引擎
三月七(爱看动漫的程序员)2 小时前
LLM面试题六
数据库·人工智能·gpt·语言模型·自然语言处理·llama·milvus
追光天使3 小时前
Mac 上使用 mysql -u root -p 命令,出现“zsh: command not found: mysql“?
数据库·mysql·macos
高铭杰3 小时前
Citus源码(2)分布式读流程分析与基础概念梳理(shardid、placementid、groupid)
数据库·分布式·postgresql·citus