[Es_1] 介绍 | 特点 | 图算法 | Trie | FST

编程就是一门不断试错的艺术。不要害怕犯错,实践才会出真知。


什么是ElasticSearch?

Elasticsearch是一个分布式的免费开源搜索和分析引擎

适用于包括文本、数字、地理空间、结构化和非结构化数据等在内的所有类型的数据。

Elasticsearch在Apache Lucene的基础上开发而成。然而,Elasticsearch不仅仅是Lucene,并且也不仅仅只是一个全文搜索引擎。 它可以被下面这样准确的形容:

  • 一个分布式的实时文档存储,每个字段都可以被索引与搜索
  • 一个分布式实时分析搜索引擎
  • 能胜任上百个服务节点的扩展,并支持PB级别的结构化或者非结构化数据

三大特点

1.强大的全文检索

2.丰富的聚合统计功能

  • 天然具备 大数据量的 数据实时聚合统计特性

(解决了 mysql 的 上旋下钻,合计小计)

3.无模式设计的非结构化


应用

做数据仓库


设计概念

  • 唯一的文档编号

特点与算法

es 是主要用 java 写的,但因为博主平时刷算法用的 C++,所以相关算法就还是用 C++来介绍啦

1. 强大的全文检索

实现

字典树Trie的C++实现

复制代码
#include <unordered_map>
#include <vector>

class TrieNode {
public:
    std::unordered_map<char, TrieNode*> children;
    bool isEnd = false;
};

class Trie {
public:
    Trie() { root = new TrieNode(); }

    void insert(const std::string& word) {
        TrieNode* curr = root;
        for (char c : word) {
            if (!curr->children.count(c)) {
                curr->children[c] = new TrieNode();
            }
            curr = curr->children[c];
        }
        curr->isEnd = true;
    }

    bool search(const std::string& word) {
        TrieNode* curr = root;
        for (char c : word) {
            if (!curr->children.count(c)) return false;
            curr = curr->children[c];
        }
        return curr->isEnd;
    }

private:
    TrieNode* root;
};

数据结构:字典树

优化 1:FST 共享前后缀---形成了 DAG

优化 2:结合偏移量 获取 id

倒排索引

关于倒排索引,去年写搜索引擎的项目中有讲解过[项目详解][boost搜索引擎#2] 建立index | 安装分词工具cppjieba | 实现倒排索引

• 正排索引适合:已知ID查内容(如SELECT * FROM table WHERE id=123)
• 倒排索引适合:关键词搜索(如搜索包含"5G手机"的文档)

场景:

• 图书馆管理系统:正排索引查书号找书,倒排索引按主题找书

• 电商平台:正排索引展示商品详情,倒排索引实现搜索功能


举例


排序评分算法

1. TF-IDF算法

复制代码
#include <cmath>

// 计算词频(TF)
double compute_tf(const Posting& post) {
    return post.positions.size() > 0 ? 
        1.0 + log(post.positions.size()) : 0.0;
}

// 计算逆文档频率(IDF)
double compute_idf(const InvertedIndex& index, 
const std::string& term,
int total_docs) {
    if (!index.count(term)) return 0.0;
    return log(total_docs / (1.0 + index.at(term).size()));
}

2. BM25算法优化版

复制代码
class BM25 {
public:
BM25(double k=1.2, double b=0.75) : k1(k), b(b) {}

double score(const Posting& post, 
double avgdl, // 平均文档长度
int doc_length,
double idf) {
    double tf = post.positions.size();
    double numerator = tf * (k1 + 1);
    double denominator = tf + k1 * (1 - b + b * doc_length / avgdl);
    return idf * (numerator / denominator);
}

private:
double k1, b;
};

3.FST优化关键点

复制代码
struct FSTNode {
    char c; 
    int output;
    std::vector<FSTNode*> next;

    // 共享后缀检测
    bool has_common_suffix(const FSTNode* other) {
        return (this->c == other->c) && 
            (this->next == other->next);
    }

跳表查找文档

1. 跳表加速查找

复制代码
// 跳表节点结构
struct SkipNode {
int doc_id;
std::vector<SkipNode*> levels;

SkipNode(int id, int level) : doc_id(id) {
    levels.resize(level, nullptr);
}
};

2. Delta编码压缩

复制代码
void delta_encode(std::vector<int>& ids) {
    if (ids.empty()) return;

    int prev = ids[0];
    for (int i=1; i<ids.size(); ++i) {
        int temp = ids[i];
        ids[i] -= prev;
        prev = temp;
    }
}

2.丰富的聚合统计功能

  1. 无模式设计的非结构化

这两个特性和其运用的算法将在下一篇文章中讲解


Wiki

Elasticsearch is a search engine based on Apache Lucene, a free and open-source search engine. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Official clients are available in Java,[2] .NET[3] (C#), PHP,[4] Python,[5] Ruby[6] and many other languages.[7] According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine.[8]

Elasticsearch 是一个基于 Apache Lucene 的搜索引擎,Apache Lucene 是一个免费的开源搜索引擎。它提供了一个分布式的、支持多租户的全文搜索引擎,具有 HTTP Web 界面和无模式的 JSON 文档。官方客户端支持 Java、[2]、.NET[3] (C#)、PHP、[4] Python、[5] Ruby[6] 和许多其他语言。[7]根据 DB-Engines 排名,Elasticsearch 是最受欢迎的企业搜索引擎。 [8]

History 历史

[edit](https://en.wikipedia.org/w/index.php?title=Elasticsearch&action=edit§ion=1 "edit")

Shay Banon created the precursor to Elasticsearch, called Compass, in 2004.[9] While thinking about the third version of Compass he realized that it would be necessary to rewrite big parts of Compass to "create a scalable search solution".[9] So he created "a solution built from the ground up to be distributed" and used a common interface, JSON over HTTP, suitable for programming languages other than Java as well.[9] Shay Banon released the first version of Elasticsearch in February 2010.[10]

Shay Banon 于 2004 年创建了 Elasticsearch 的前身,称为 Compass。[9]在考虑 Compass 的第三个版本时,他意识到有必要重写 Compass 的大部分内容以"创建一个可扩展的搜索解决方案"。[9]因此,他创建了"一个从头开始构建以供分发的解决方案",并使用了一个通用接口,即 JSON over HTTP,也适用于 Java 以外的编程语言。[9]Shay Banon 于 2010 年 2 月发布了 Elasticsearch 的第一个版本。 [10]

Elastic NV was founded in 2012 to provide commercial services and products around Elasticsearch and related software.[11] In June 2014, the company announced raising 70 million in a Series C funding round, just 18 months after forming the company. The round was led by [New Enterprise Associates](https://en.wikipedia.org/wiki/New_Enterprise_Associates "New Enterprise Associates") (NEA). Additional funders include [Benchmark Capital](https://en.wikipedia.org/wiki/Benchmark_Capital "Benchmark Capital") and [Index Ventures](https://en.wikipedia.org/wiki/Index_Ventures "Index Ventures"). This round brought total funding to 104M.[12]

Elastic NV 成立于 2012 年,旨在围绕 Elasticsearch 和相关软件提供商业服务和产品。[11]2014 年 6 月,该公司宣布在 C 轮融资中筹集了 7000 万美元,而此时距离公司成立仅 18 个月。该轮融资由 New Enterprise Associates (NEA) 领投。其他资助者包括 Benchmark Capital 和 Index Ventures。本轮融资总额达到 $104M。

In March 2015, the company Elasticsearch changed its name to Elastic .[13]

2015 年 3 月,Elasticsearch 公司更名为 Elastic。 [13]

In June 2018, Elastic filed for an initial public offering with an estimated valuation of between 1.5 and 3 billion dollars.[14] On 5 October 2018, Elastic was listed on the New York Stock Exchange.[15]

2018 年 6 月,Elastic 申请了首次公开募股,估计估值在 1.5 至 30 亿美元之间。[14]2018 年 10 月 5 日,Elastic 在纽约证券交易所上市。 [15]

Developed from the Found acquisition by Elastic in 2015,[16] Elastic Cloud is a family of Elasticsearch-powered SaaS offerings which include the Elasticsearch Service, as well as Elastic App Search Service, and Elastic Site Search Service which were developed from Elastic's acquisition of Swiftype.[17] In late 2017, Elastic formed partnerships with Google to offer Elastic Cloud in Google Cloud Platform (GCP), and Alibaba to offer Elasticsearch and Kibana in Alibaba Cloud.

Elastic Cloud 是在 2015 年被 Elastic 收购后开发的,[16] 是一系列由 Elasticsearch 提供支持的 SaaS 产品,其中包括 Elasticsearch Service,以及 Elastic App Search Service 和 Elastic Site Search Service,它们是在 Elastic 收购 Swiftype 后开发的。[17]2017 年底,Elastic 与 Google 建立了合作伙伴关系,在 Google Cloud Platform (GCP) 中提供 Elastic Cloud,并与阿里巴巴建立了合作伙伴关系,在阿里云中提供 Elasticsearch 和 Kibana。

Elasticsearch Service users can create secure deployments with partners, Google Cloud Platform (GCP) and Alibaba Cloud.[18]

Elasticsearch Service 用户可以与合作伙伴 Google Cloud Platform (GCP) 和阿里云一起创建安全部署。 [18]

Licensing changes 许可变更

[ edit]

In January 2021, Elastic announced that starting with version 7.11, they would be relicensing their Apache 2.0 licensed code in Elasticsearch and Kibana to be dual licensed under Server Side Public License and the Elastic License, neither of which is recognized as an open-source license.[19][20] Elastic blamed Amazon Web Services (AWS) for this change, objecting to AWS offering Elasticsearch and Kibana as a service directly to consumers and claiming that AWS was not appropriately collaborating with Elastic.[20][21] Critics of the re-licensing decision predicted that it would harm Elastic's ecosystem and noted that Elastic had previously promised to "never....change the license of the Apache 2.0 code of Elasticsearch, Kibana, Beats, and Logstash". Amazon responded with plans to fork the projects and continue development under Apache License 2.0.[22][23] Other users of the Elasticsearch ecosystem, including Logz.io, CrateDB and Aiven, also committed to the need for a fork, leading to a discussion of how to coordinate the open source efforts.[24][25][26] Due to potential trademark issues with using the name "Elasticsearch", AWS rebranded their fork as OpenSearch in April 2021.[27][28]

2021 年 1 月,Elastic 宣布,从版本 7.11 开始,他们将在 Elasticsearch 和 Kibana 中重新许可其 Apache 2.0 许可代码,以根据服务器端公共许可证和 Elastic 许可证进行双重许可,这两者都不被视为开源许可证。 [19] [20]Elastic 将这一变化归咎于 Amazon Web Services (AWS),反对 AWS 直接向消费者提供 Elasticsearch 和 Kibana 即服务,并声称 AWS 没有与 Elastic 进行适当的合作。 [20] [21]重新许可决定的批评者预测它将损害 Elastic 的生态系统,并指出 Elastic 之前曾承诺"永远不会......更改 Elasticsearch、Kibana、Beats 和 Logstash 的 Apache 2.0 代码的许可证"。亚马逊回应说,计划分叉项目并在 Apache 许可证 2.0 下继续开发。 [22][23]Elasticsearch 生态系统的其他用户,包括 Logz.io、CrateDB 和 Aiven,也承诺需要分叉,从而引发了关于如何协调开源工作的讨论。 [24] [25] [26]由于使用"Elasticsearch"名称存在潜在的商标问题,AWS 于 2021 年 4 月将其分叉更名为 OpenSearch。 [27] [28]

In August 2024 the GNU Affero General Public License was added to ElasticSearch version 8.16.0 as an option, making Elasticsearch free and open-source again.[22][29]

2024 年 8 月,GNU Affero 通用公共许可证作为选项添加到 ElasticSearch 版本 8.16.0 中,使 Elasticsearch 再次免费和开源。 [22] [29]

Features 特征

[edit](https://en.wikipedia.org/w/index.php?title=Elasticsearch&action=edit§ion=3 "edit")

|---------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| | This article may be too technical for most readers to understand . Please help improve it to make it understandable to non-experts, without removing the technical details. ( May 2023 ) ( Learn how and when to remove this message) |

Elasticsearch can be used to search any kind of document. It provides scalable search, has near real-time search, and supports multitenancy.[30] "Elasticsearch is distributed, which means that indices can be divided into shards and each shard can have zero or more replicas. Each node hosts one or more shards and acts as a coordinator to delegate operations to the correct shard(s). Rebalancing and routing are done automatically".[30] Related data is often stored in the same index, which consists of one or more primary shards, and zero or more replica shards. Once an index has been created, the number of primary shards cannot be changed.[31]

Elasticsearch 可用于搜索任何类型的文档。它提供可扩展的搜索,具有近乎实时的搜索功能,并支持多租户。[30]"Elasticsearch 是分布式的,这意味着索引可以分为多个分片,每个分片可以有零个或多个副本。每个节点托管一个或多个分片,并充当协调器,将作委托给正确的分片。再平衡和路由是自动完成的"。[30]相关数据存储在同一索引中,该索引由一个或多个主分片和零个或多个副本分片组成。一旦创建了索引,主分片的数量就无法更改。 [31]

Elasticsearch is developed alongside the data collection and log-parsing engine Logstash, the analytics and visualization platform Kibana, and the collection of lightweight data shippers called Beats. The four products are designed for use as an integrated solution, referred to as the "Elastic Stack".[32] (Formerly the "ELK stack", short for "Elasticsearch, Logstash, Kibana".)

Elasticsearch 与数据收集和日志解析引擎 Logstash、分析和可视化平台 Kibana 以及名为 Beats 的轻量级数据采集器集合一起开发。这四款产品旨在用作集成解决方案,称为"Elastic Stack"。[32](以前称为"ELK 堆栈","Elasticsearch、Logstash、Kibana"的缩写。

Elasticsearch uses Lucene and tries to make all its features available through the JSON and Java API. It supports facetting and percolating (a form of prospective search),[33][34] which can be useful for notifying if new documents match for registered queries. Another feature, "gateway", handles the long-term persistence of the index;[35] for example, an index can be recovered from the gateway in the event of a server crash. Elasticsearch supports real-time GET requests, which makes it suitable as a NoSQL datastore,[36] but it lacks distributed transactions.[37]

Elasticsearch 使用 Lucene,并尝试通过 JSON 和 Java API 提供其所有功能。它支持分面和渗透(一种前瞻性搜索形式),[33][34]这对于通知新文档是否与已注册的查询匹配很有用。另一个功能,"网关",处理索引的长期持久性;[35]例如,在服务器崩溃的情况下,可以从网关恢复索引。Elasticsearch 支持实时 GET 请求,这使得它适合作为 NoSQL 数据存储,[36]但它缺乏分布式事务。 [37]

On 20 May 2019, Elastic made the core security features of the Elastic Stack available free of charge, including TLS for encrypted communications, file and native realm for creating and managing users, and role-based access control for controlling user access to cluster APIs and indexes.[38] The corresponding source code is available under the "Elastic License", a source-available license.[39] In addition, Elasticsearch now offers SIEM[40] and Machine Learning[41] as part of its offered services.

2019 年 5 月 20 日,Elastic 免费提供了 Elastic Stack 的核心安全功能,包括用于加密通信的 TLS、用于创建和管理用户的文件和原生领域,以及用于控制用户对集群 API 和索引的访问的基于角色的访问控制。[38]相应的源代码在"Elastic License"下提供,这是一种源代码可用的许可证。[39]此外,Elasticsearch 现在提供 SIEM[40] 和 Machine Learning[41] 作为其提供的服务的一部分。

相关推荐
冼紫菜30 分钟前
Spring 项目无法连接 MySQL:Nacos 配置误区排查与解决
java·spring boot·后端·mysql·docker·springcloud
大G哥1 小时前
用 Go 和 TensorFlow 实现图像验证码识别系统
开发语言·后端·golang·tensorflow·neo4j
一个憨憨coder1 小时前
Spring 如何解决循环依赖问题?
java·后端·spring
代码哈士奇2 小时前
认识中间件-以及两个简单的示例
后端·中间件·typescript·nodejs·nest
李匠20243 小时前
C++GO语言微服务项目之 go语言基础语法
开发语言·c++·后端·golang
Python私教4 小时前
Spring Boot操作MongoDB的完整示例大全
spring boot·后端·mongodb
信仰JR5 小时前
OpenKylin安装Elastic Search8
elasticsearch·es
Zyxalia6 小时前
gin + es 实践 07
网络·elasticsearch·gin
巴拉特好队友6 小时前
说说es配置项的动态静态之分和集群配置更新API
大数据·elasticsearch·搜索引擎