Elasticsearch：从 ES|QL 到 Python 数据帧

在我之前的文章 "Elasticsearch：ES|QL 查询展示"，我展示了如何在 Kibana 中使用 ES|QL 对索引来进行查询及统计。在很多的情况下，我们需要在客户端中来对数据进行查询，那么我们该怎么办呢？我们需要使用到 Elasticsearch 的客户端。在今天的文章中，我们来展示如何使用 Python 来对数据进行查询。

注意：为了使用 ES|QL，我们的 Elastic Stack 版本至少在 8.12 及以上。

安装

如果你还没有安装好自己的 Elasticsearch 及 Kibana，请参考如下的链接来进行安装：

在安装的时候，我们选择 Elastic Stack 8.x 来进行安装。特别值得指出的是：ES|QL 只在 Elastic Stack 8.11 及以后得版本中才有。你需要下载 Elastic Stack 8.11 及以后得版本来进行安装。

在首次启动 Elasticsearch 的时候，我们可以看到如下的输出：

我们需要记下 Elasticsearch 超级用户 elastic 的密码。

我们还需要安装 Elasticsearch 的 python 依赖包：

ini 复制代码

pip3 install elasticsearch==8.12.1

markdown 复制代码

1.  $ pip3 list | grep elasticsearch
2.  elasticsearch                8.12.1

准备数据

我们参考之前的文章 "Elasticsearch：ES|QL 查询展示" 来创建索引：

json 复制代码

1.  PUT sample_data
2.  {
3.    "mappings": {
4.      "properties": {
5.        "client.ip": {
6.          "type": "ip"
7.        },
8.        "message": {
9.          "type": "keyword"
10.        }
11.      }
12.    }
13.  }

less 复制代码

1.  PUT sample_data/_bulk
2.  {"index": {}}
3.  {"@timestamp": "2023-10-23T12:15:03.360Z", "client.ip": "172.21.2.162", "message": "Connected to 10.1.0.3", "event.duration": 3450233}
4.  {"index": {}}
5.  {"@timestamp": "2023-10-23T12:27:28.948Z", "client.ip": "172.21.2.113", "message": "Connected to 10.1.0.2", "event.duration": 2764889}
6.  {"index": {}}
7.  {"@timestamp": "2023-10-23T13:33:34.937Z", "client.ip": "172.21.0.5", "message": "Disconnected", "event.duration": 1232382}
8.  {"index": {}}
9.  {"@timestamp": "2023-10-23T13:51:54.732Z", "client.ip": "172.21.3.15", "message": "Connection error", "event.duration": 725448}
10.  {"index": {}}
11.  {"@timestamp": "2023-10-23T13:52:55.015Z", "client.ip": "172.21.3.15", "message": "Connection error", "event.duration": 8268153}
12.  {"index": {}}
13.  {"@timestamp": "2023-10-23T13:53:55.832Z", "client.ip": "172.21.3.15", "message": "Connection error", "event.duration": 5033755}
14.  {"index": {}}
15.  {"@timestamp": "2023-10-23T13:55:01.543Z", "client.ip": "172.21.3.15", "message": "Connected to 10.1.0.1", "event.duration": 1756467}

使用 Elasticsearch 客户端来进行查询

Elasticsearch 查询语言 (ES|QL) 提供了一种强大的方法来过滤、转换和分析 Elasticsearch 中存储的数据。它旨在易于最终用户、SRE 团队、应用程序开发人员和管理员学习和使用。但它也非常适合熟悉 Pandas 和其他基于数据框的框架的数据科学家。

事实上，ES|QL 查询会生成带有命名列的表，即数据帧。但是如何使用 Python 处理这些数据呢？ ES|QL 目前没有 Apache Arrow 输出，但 CSV 输出是一个很好的开始。

我们使用如下的测试程序：

esql.py

ini 复制代码

1.  from io import StringIO
2.  import numpy as np
3.  import os

5.  from elasticsearch import Elasticsearch
6.  import pandas as pd

8.  endpoint = os.getenv("ES_SERVER")
9.  username = os.getenv("ES_USER")
10.  password = os.getenv("ES_PASSWORD")
11.  fingerprint = os.getenv("ES_FINGERPRINT")

13.  url = f"https://{endpoint}:9200"

15.  es = Elasticsearch( url ,
16.      basic_auth = (username, password),
17.      ssl_assert_fingerprint = fingerprint,
18.      http_compress = True )

20.  # print(es.info())

22.  response = es.esql.query(query="FROM sample_data", format="csv")
23.  df = pd.read_csv(StringIO(response.body))
24.  print(df)
25.  print("==================================================================")

27.  response = es.esql.query(
28.      query="""
29.      FROM sample_data
30.      | LIMIT 5
31.      | sort @timestamp desc
32.      | WHERE event.duration > 3000000
33.      | WHERE message LIKE "Connection *"
34.      """,
35.      format="csv"
36.  )

38.  df = pd.DataFrame = pd.read_csv(StringIO(response.body))

40.  print(df)
41.  print("==================================================================")

44.  response = es.esql.query(
45.      query="""
46.      FROM sample_data
47.      | STATS avg=AVG(event.duration), count=COUNT(*) BY client.ip
48.      | SORT count
49.      """,
50.      format="csv"
51.  )

53.  df = pd.DataFrame = pd.read_csv(
54.      StringIO(response.body),
55.      dtype={"count":"Int64", "avg":np.float64}
56.  )

58.  print(df)
59.  print("==================================================================")

在运行上面的代码之前，我们需要在 terminal 中设置相应的环境变量：

ini 复制代码

1.  export ES_SERVER="localhost"
2.  export ES_USER="elastic"
3.  export ES_PASSWORD="q2rqAIphl-fx9ndQ36CO"
4.  export ES_FINGERPRINT="bce66ed55097f255fc8e4420bdadafc8d609cc8027038c2dd09d805668f3459e"

然后，我们使用如下的命令来运行：

python3 esql.py

vbnet 复制代码

1.  $ python3 esql.py 
2.  /Users/liuxg/python/esql/esql.py:22: ElasticsearchWarning: No limit defined, adding default limit of [500]
3.    response = es.esql.query(query="FROM sample_data", format="csv")
4.                   @timestamp     client.ip  event.duration                message
5.  0  2023-10-23T12:15:03.360Z  172.21.2.162         3450233  Connected to 10.1.0.3
6.  1  2023-10-23T12:27:28.948Z  172.21.2.113         2764889  Connected to 10.1.0.2
7.  2  2023-10-23T13:33:34.937Z    172.21.0.5         1232382           Disconnected
8.  3  2023-10-23T13:51:54.732Z   172.21.3.15          725448       Connection error
9.  4  2023-10-23T13:52:55.015Z   172.21.3.15         8268153       Connection error
10.  5  2023-10-23T13:53:55.832Z   172.21.3.15         5033755       Connection error
11.  6  2023-10-23T13:55:01.543Z   172.21.3.15         1756467  Connected to 10.1.0.1
12.  ==================================================================
13.                   @timestamp    client.ip  event.duration           message
14.  0  2023-10-23T13:52:55.015Z  172.21.3.15         8268153  Connection error
15.  ==================================================================
16.  /Users/liuxg/python/esql/esql.py:44: ElasticsearchWarning: No limit defined, adding default limit of [500]
17.    response = es.esql.query(
18.            avg  count     client.ip
19.  0  1232382.00      1    172.21.0.5
20.  1  3450233.00      1  172.21.2.162
21.  2  2764889.00      1  172.21.2.113
22.  3  3945955.75      4   172.21.3.15
23.  ==================================================================

很显然，我们得到了最终的结果。