在我之前的文章:
我有详细描述如何使用 ingest pipeline 来丰富数据。在今天的文章中里,我们来更加详细地使用一个具体的例子来进行展示。更多官方文档描述,我们可以详细参阅文章 Enrich your data | Elasticsearch Guide [8.8] | Elastic。
什么是丰富数据
简单地说,我们可以使用其他的数据集里的数据添加到现有的数据集中。这样在我们的最终的数据集中,它含有另外一个数据集里的数据供我们分析数据。我们知道如果是独立于 Elasticsearch 的数据库,我们只有通过 Logstash 来完成这种操作。针对两个不同的 Elasticsearch 索引来说,我们可以使用 enrich processor 来完成两个不同的数据集之间的 "join" 操作。比如:
data:image/s3,"s3://crabby-images/593b1/593b1b5c5c6285401966d071df360a0902aae38b" alt=""
如上所示,我们有两个数据集:registration 及 customer。他们的数据分别如上所示。 它们是以 JSON 格式来进行表达的。我们可以通过 email 进行关联,那么最终我们可以得到如图右边的那个被丰富的数据。这个数据它不仅含有 registraion 里的数据,而且它还含有 customer 里的数据。从某种意义上讲,我们把两个不同的数据集通过 email 进行关联,并最终形成了一个被丰富的数据集。这个对于我们最终分析数据非常有效。
Elasticsearch enrich processor 的工作流程如下:
data:image/s3,"s3://crabby-images/7177b/7177bad9d5bd841970c829adfe46a0fcbcef4088" alt=""
数据描述
我们假想有一个活动的 signup 应用。在活动签名的时候这个应用收集了如下的信息:
data:image/s3,"s3://crabby-images/c09af/c09afd51cbbd75a5b341622278463656f989126b" alt=""
如上所示,它只含有 email,location_id,paid_amount 及 product 四个字段的信息。这个信息被保存于一个 CSV 文件中。之后,市场部门提供了更多的信息表格给我我们。这些信息包含 location,member_type 及 user 等信息。
data:image/s3,"s3://crabby-images/d4350/d4350100f67c91bebca019d4f1458a285b5fb2ea" alt=""
为了方便大家理解这个问题,我们可以在地址 GitHub - liu-xiao-guo/elasticsearch-ingest 找到相应的数据表述:
data:image/s3,"s3://crabby-images/d2072/d207201ca9acf8c4e95347680f6f1d62dd4c8e9b" alt=""
data:image/s3,"s3://crabby-images/3d18a/3d18ab92cfe50f9c80281948dfd992d2e4f35ac6" alt=""
data:image/s3,"s3://crabby-images/81cde/81cdeb40ceaa5a1dd9a4ef0c13992a5d5ea18bb4" alt=""
data:image/s3,"s3://crabby-images/56f47/56f47b90a77f5fcf4df592576bccf638b6c2d643" alt=""
我们希望通过 enrich processor 的处理,我们最终能得到像如下结果的数据集:
data:image/s3,"s3://crabby-images/18334/1833486d134642c344bef876e6c7601484a9e7fa" alt=""
也就是说,我们通过 enrich processor 的一番操作,我们可以把匹配的 user,location 及 member_type 信息添加进来,也即丰富原来的数据。
导入数据
我们可以使用 Kibana 来写入数据:
导入 user.csv
data:image/s3,"s3://crabby-images/cad26/cad26399500ce9e490937b318ddf980a0a437f40" alt=""
data:image/s3,"s3://crabby-images/49575/49575d13d086356bd45cffae877a11cd0a9d6ba9" alt=""
data:image/s3,"s3://crabby-images/45f8e/45f8ed16a7a1135745422e1540928783cd97f14b" alt=""
data:image/s3,"s3://crabby-images/a1648/a1648da544046e0a0075d325b6241871a408da0f" alt=""
导入 location.csv
data:image/s3,"s3://crabby-images/d0c79/d0c79979732b9022da32178fd518c89cd8fefb64" alt=""
data:image/s3,"s3://crabby-images/d6956/d69560507e3689ab37ca5747a0a5091a792a592d" alt=""
在上面我们需要修改 point 为 geo_point 数据类型。
data:image/s3,"s3://crabby-images/ac1ad/ac1ad8ac157bde6386ae7b3f59abb8d1b3daff21" alt=""
data:image/s3,"s3://crabby-images/85342/85342702fe5c7f904560720db77a4e155e2fe6ad" alt=""
data:image/s3,"s3://crabby-images/669bd/669bd7cf25103a31c7e6622aab83e40df1f4ab26" alt=""
导入 member_type.csv
按照同样的方法,我们来导入 member_type.csv:
data:image/s3,"s3://crabby-images/522a6/522a6f008c5c8c844da9405d06b5407b102ee6f6" alt=""
data:image/s3,"s3://crabby-images/27c0d/27c0d629a904d6d1f0d8ee4825ea4bfef2c6ff81" alt=""
data:image/s3,"s3://crabby-images/f42e6/f42e69b669bb442905c5fb572eb24d7b837ef8db" alt=""
在上面,我们添加了如下的 json processor:
{
"json" : {
"field" : "price_range"
}
}
data:image/s3,"s3://crabby-images/e7a5f/e7a5f4389b33ecd7af97c9af264a801aad2488e3" alt=""
data:image/s3,"s3://crabby-images/c50f9/c50f9924d8d2f394578b31f2ec2fa829d26f8ee1" alt=""
data:image/s3,"s3://crabby-images/808d2/808d24f684311e6608e9fc2ba768aae1ecf9c03b" alt=""
创建 enrich policy
我们可以参考链接:https://github.com/liu-xiao-guo/elasticsearch-ingest/blob/master/part-2/policy/user.txt
// Create users policy
PUT /_enrich/policy/user_policy
{
"match": {
"indices": "user",
"match_field": "email",
"enrich_fields": ["first_name", "last_name", "city", "zip", "state"]
}
}
PUT /_enrich/policy/user_policy/_execute
我们在 Kibana 中运行上面的命令。在上面的 user_policy 中,我们使用 user 索引中的 email 字段,如果有匹配的话,那么 user 索引中相应的文档的 first_name,last_name,city,zip 及 state 将被丰富到文档中。
我们参考链接:https://github.com/liu-xiao-guo/elasticsearch-ingest/blob/master/part-2/policy/location.txt
PUT /_enrich/policy/location_policy
{
"match": {
"indices": "location",
"match_field": "location_id",
"enrich_fields": ["point"]
}
}
PUT /_enrich/policy/location_policy/_execute
在 Kibana 中运行上面的命令。
我们参考链接:https://github.com/liu-xiao-guo/elasticsearch-ingest/blob/master/part-2/policy/member_type.txt
// Create member_type policy
PUT /_enrich/policy/member_type_policy
{
"range": {
"indices": "member_type",
"match_field": "price_range",
"enrich_fields": ["member_type"]
}
}
PUT /_enrich/policy/member_type_policy/_execute
我们在 Kibana 中运行上面的命令。
我们可以在 index management 中查看到新生成的 enrich index:
data:image/s3,"s3://crabby-images/9202f/9202f6199745b253c0b272b76b42292ff7c651c3" alt=""
导入 signup.csv
data:image/s3,"s3://crabby-images/764e6/764e6d1dc33586fa984ff6334afe405dede6b7dc" alt=""
data:image/s3,"s3://crabby-images/9fc6f/9fc6fe9de917698b54ae714713429a43fe6a1e53" alt=""
data:image/s3,"s3://crabby-images/b5e0f/b5e0f8748633658f48b80616d06bb4795b14da47" alt=""
data:image/s3,"s3://crabby-images/f6e29/f6e2973ae5485b1d736e8d2e20af511599d7c925" alt=""
如果你看看我们之前想要的结果的数据 mapping:
data:image/s3,"s3://crabby-images/38879/38879db1a752f3152107323849b270e55f3d50d8" alt=""
我们需要添加 geo 字段:
"geo": {
"properties": {
"point": {
"type": "geo_point"
}
}
}
我们需要更进一步修改 ingest pipeline。我们参考链接:https://github.com/liu-xiao-guo/elasticsearch-ingest/blob/master/part-2/pipeline/signup.json
data:image/s3,"s3://crabby-images/78dd3/78dd350feaba4bda0dbe5cdea308e4495e528531" alt=""
我们需要添加如下的三个 enrich processor:
{
"enrich" : {
"description": "Add 'user' data based on 'email'",
"policy_name": "user_policy",
"field" : "email",
"target_field": "user",
"max_matches": "1"
}
},
{
"enrich" : {
"description": "Add 'member_type' data based on 'paid_amount'",
"policy_name": "member_type_policy",
"field" : "paid_amount",
"target_field": "member_type",
"max_matches": "1"
}
},
{
"enrich" : {
"description": "Add 'geo' data based on 'location_id'",
"policy_name": "location_policy",
"field" : "location_id",
"target_field": "geo",
"max_matches": "1"
}
},
点击上面的 import 按钮:
data:image/s3,"s3://crabby-images/2c44b/2c44b0c455dbb7ea9afe38a3e69edccfa0bc61e1" alt=""
data:image/s3,"s3://crabby-images/43b77/43b776d532949afbc2466ef6a8dda380a74d7287" alt=""
data:image/s3,"s3://crabby-images/2a776/2a77652a0bb2c635e4227777815ed99c18421040" alt=""
我们接下来针对 signup 索引来做一个搜索:
GET signup/_search?filter_path=**.hits
上面的命令返回的结果为:
{
"hits": {
"hits": [
{
"_index": "signup",
"_id": "Q9mvgokBWubr9hCu1VXI",
"_score": 1,
"_source": {
"member_type": {
"member_type": "regular",
"price_range": {
"lte": 5
}
},
"geo": {
"location_id": 2351,
"point": "POINT(-71.61 42.28)"
},
"product": "earlybird",
"paid_amount": 5,
"user": {
"zip": 9303,
"city": "Arleta",
"last_name": "Fly",
"state": "CA",
"first_name": "Marty",
"email": "martymcfly@backtothefuture.com"
},
"email": "martymcfly@backtothefuture.com",
"location_id": 2351
}
},
{
"_index": "signup",
"_id": "RNmvgokBWubr9hCu1VXI",
"_score": 1,
"_source": {
"member_type": {
"member_type": "regular",
"price_range": {
"lte": 5
}
},
"geo": {
"location_id": 2322,
"point": "POINT(-71.63 42.56)"
},
"product": "earlybird",
"paid_amount": 5,
"user": {
"zip": 58008,
"city": "Springfield",
"last_name": "Simpson",
"state": "OR",
"first_name": "Homer",
"email": "homersimpson@springfield.com"
},
"email": "homersimpson@springfield.com",
"location_id": 2322
}
},
{
"_index": "signup",
"_id": "RdmvgokBWubr9hCu1VXI",
"_score": 1,
"_source": {
"member_type": {
"member_type": "premium",
"price_range": {
"gt": 5
}
},
"geo": {
"location_id": 2019,
"point": "POINT(-72.68 42.2)"
},
"product": "regular",
"paid_amount": 10,
"user": {
"zip": 99686,
"city": "Valdez",
"last_name": "Riker",
"state": "AK",
"first_name": "Will",
"email": "willriker@federation.com"
},
"email": "willriker@federation.com",
"location_id": 2019
}
}
]
}
}
从上面的输出中,我们可以看出来我们已经成功地丰富了 signup 索引。