文章目录
day07_统计类标签
一、PSM标签开发(掌握)
1、PSM介绍
PSM模型即价格敏感度测试模型,是目前在价格测试的诸多模型中,最简单、最实用。为大多数市场研究公司所认可。通过PSM模型,不仅可以得出最优价格,而且得出合理的价格区间。
2、需求分析
注意:如果在实际工作中,对需求、计算公式不清晰,可以找项目经理给你解答
sql
tdonr 优惠订单占比 = 优惠订单数 / 订单总数
adar 平均优惠金额占比 = 平均优惠金额 / 平均每单应收金额
tdar 优惠总金额占比 = 优惠总金额 / 订单总金额
psm = 优惠订单占比 + 平均优惠金额占比 + 优惠总金额占比
对应的计算公式和参数解释如下:
tdonr = tdon / ton
tdon: total discount order number 优惠订单数
ton: total order number 总订单总数
adar = ada / ara
ada: average discount amount 平均优惠金额
ara: average receivable amount 平均每单应收金额
tdar = tda / tra
tda: total discount amount 优惠总金额
tra: total receivable amount 应收总金额
ra: receivable amount 应收金额
da: discount amount 优惠金额
pa: practical amount 实收金额
psm = tdonr + adar + tdar
psm = tdon / ton + ada / ara + tda / tra
得到比值后,划分区间区分用户的类别。
1~3 极度敏感
0.6~1 比较敏感
0.4~0.6 一般敏感
0.2~0.4 不太敏感
0.0~0.2 极度不敏感
sql
1- PSM的四级标签ID=52
2- 读取标签配置表的数据,读取四级标签的rule规则
inType=Hive##nodes=up01:9083##table=dwm.dwm_sell_o2o_order_i##selectFields=zt_id,order_no,order_total_amount,discount_amount,real_paid_amount##range=90
业务数据存储介质类型inType: Hive
业务数据存储介质连接信息nodes: up01:9083
业务数据存储具体所在位置table: dwm.dwm_sell_o2o_order_i
计算标签涉及的业务数据字段selectFields: zt_id,order_no,order_total_amount,discount_amount,real_paid_amount
计算标签涉及的业务数据范围range: 90,统计最近90天内的业务数据
3- 解析rule,根据解析后的实例对象读取业务数据
4- 读取五级标签配置数据,pid=PSM的四级标签ID
id,name,rule
53,极度敏感,1~3
54,比较敏感,0.6~1
55,一般敏感,0.4~0.6
56,不太敏感,0.2~0.4
57,极度不敏感,0.0~0.2
5- 开发Spark计算程序将业务数据与五级标签配置数据进行关联,给用户打上具体的标签
5.1- 通过比较discount_amount>0标记订单是否为折扣订单
5.2- 根据上面的计算公式,逐步计算tdon / ton + ada / ara + tda / tra
5.3- 最终计算得到PSM的值
5.4- 对五级标签配置数据进行处理,切分得到范围的开始和结束
5.5- 将处理后的业务数据与五级标签配置数据进行关联,打上具体标签
6- 将结果数据输出到ES中
3、代码实现
python
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from tags.base.abstract_tag_base import AbstractTagBase
class PSMTag(AbstractTagBase):
def mark_tag(self,business_df:DataFrame, five_tag_df:DataFrame):
# 1- 通过比较discount_amount > 0 标记订单是否为折扣订单
# zt_id,order_no,order_total_amount,discount_amount,real_paid_amount,是否是折扣订单
# 下面的两种写法都行
# business_df.withColumn("is_discount",F.when(F.expr("discount_amount>0"),1).otherwise(0))
is_discount_df = business_df.withColumn("is_discount",F.when(business_df.discount_amount>0,1).otherwise(0))
# is_discount_df.show()
# 2- 根据上面的计算公式,逐步计算tdon / ton + ada / ara + tda / tra
# 2.1- 对字段进行取别名的操作,为了方便后续计算公式的使用
alias_df = is_discount_df.select(
"zt_id",
"order_no",
"is_discount",
is_discount_df['order_total_amount'].alias('ra'), # 应收金额
is_discount_df['discount_amount'].alias('da'), # 优惠金额
is_discount_df['real_paid_amount'].alias('pa') # 实收金额
)
# 2.2- 计算分子和分母
agg_business_df = alias_df.groupBy("zt_id").agg(
F.sum("is_discount").alias("tdon"), # 总的优惠订单数
F.count("order_no").alias("ton"), # 总订单数
F.avg("da").alias("ada"), # 平均优惠金额
F.avg("ra").alias("ara"), # 平均每单应收金额
F.sum("da").alias("tda"), # 优惠总金额
F.sum("ra").alias("tra") # 应收总金额
)
# 3- 最终计算得到PSM的值
new_business_df = agg_business_df.select(
"zt_id",
(agg_business_df.tdon / agg_business_df.ton + agg_business_df.ada / agg_business_df.ara + agg_business_df.tda / agg_business_df.tra).alias("psm")
)
# 4- 对五级标签配置数据进行处理,切分得到范围的开始和结束
new_five_tag_df = five_tag_df.select(
"id",
F.split("rule","~")[0].alias("start"),
F.split("rule","~")[1].alias("end")
)
# 5- 将处理后的业务数据与五级标签配置数据进行关联,打上具体标签
"""
where这里使用psm between start and end不太合适,因为标签配置中五级标签的rule规则的范围有重叠,同时between and是左右都是闭区间
"""
result_df = new_business_df.join(new_five_tag_df).where(
"psm>=start and psm<end"
).select(
new_business_df['zt_id'].alias('user_id'),
new_five_tag_df['id'].alias('tags_id_times')
)
return result_df
if __name__ == '__main__':
condition = " and datediff(`current_date`(),to_date(trade_date))<=90 and zt_id!=0 and zt_id is not null "
tag_obj = PSMTag()
tag_obj.execute(app_name="psm_tag",partitions=5,four_tag_id=52,condition=condition)
4、结果数据核对
- HiveSQL语句
sql
with alias_df as (
select
zt_id,
order_no,
order_total_amount as ra,
discount_amount as da,
real_paid_amount as pa,
if(discount_amount>0,1,0) as is_discount
from dwm.dwm_sell_o2o_order_i
where datediff(`current_date`(),to_date(trade_date))<=90 and zt_id!=0 and zt_id is not null
and zt_id in (8678035,16471564,3735457)
),
agg_business_df as (
select
zt_id,
sum(is_discount) as tdon,
count(order_no) as ton,
avg(da) as ada,
avg(ra) as ara,
sum(da) as tda,
sum(ra) as tra
from alias_df group by zt_id
)
select
zt_id,
(agg_business_df.tdon / agg_business_df.ton + agg_business_df.ada / agg_business_df.ara + agg_business_df.tda / agg_business_df.tra) as psm
from agg_business_df
二、PyCharm连接Spark
- 启动Spark的ThriftServer服务器:方便开发和测试SparkSQL语句
sql
cd /export/server/spark/sbin
./start-thriftserver.sh \
--hiveconf hive.server2.thrift.port=10001 \
--hiveconf hive.server2.thrift.bind.host=up01 \
--hiveconf spark.sql.warehouse.dir=hdfs://up01:8020/user/hive/warehouse \
--master local[*]
注意: 上面的start-thriftserver这一段是一条语句
- PyCharm连接配置:注意端口号是10001
三、新老会员标签开发(了解)
1、标签介绍
新会员:首次消费后30天内的;老会员:首次消费后大于30天;除此之外,还要区分没有消过费的会员。通过对会员进行标记,可以对不同会员采取不同的营销策略。
2、需求分析
sql
需要使用Hive中的dwm.dwm_mem_first_buy_i表,如果首次购买时间在30天内的则是新会员,否则是老会员,如果不在dwm.dwm_mem_first_buy_i表里的则是未消费会员。
1- 如果在dwm.dwm_mem_first_buy_i表中能够找到消费记录,至少该用户在平台上消费过
1.1- 如果消费记录的实际距离当前过去的天数<=30天,新会员用户
1.2- 如果消费记录的实际距离当前过去的天数>30天,老会员用户
2- 如果在dwm.dwm_mem_first_buy_i表中找不到消费记录,未消费会员
3、代码实现
因为SQL语句给用户打上标签以后的结果数据,目前无法直接输出到ElasticSearch中。因此需要将SQL的标签结果数据存储到Hive表中,再使用SeaTunnel将Hive的数据导出到ElasticSearch。
- Hive建表语句:注意要在Hive中执行
sql
-- 在hive中建库建表
create database if not exists ads;
create table if not exists ads.ads_mem_new_old_user_i
(
user_id bigint comment '乘客id',
tags_id string comment '标签id'
)
comment '新老会员标签表'
partitioned by (dt string comment '创建日期')
row format delimited fields terminated by ','
stored as orc
tblsql ('orc.compress' = 'SNAPPY')
;
- HiveSQL/SparkSQL标签计算代码
sql
with first_buy as (
-- 查询有哪些新会员
select
zt_id
from dwm.dwm_mem_first_buy_i
where datediff(current_date(), to_date(dt)) <= 30
)
-- insert overwrite:能否反复多次跑,同时只保留一份结果
insert overwrite table ads.ads_mem_new_old_user_i
select
all_user.zt_id,
case
when first_buy.zt_id is not null and all_first_buy.zt_id is not null then 86 -- 新会员
when first_buy.zt_id is null and all_first_buy.zt_id is not null then 87 -- 老会员
else 88 -- 未消费会员
end as tags_id,
'2025-01-19' as dt
from dwd.dwd_mem_member_union_i as all_user -- 会员注册表(包含了:新会员、老会员、未消费会员所有信息)
left join dwm.dwm_mem_first_buy_i as all_first_buy on all_user.zt_id=all_first_buy.zt_id
left join first_buy on all_user.zt_id=first_buy.zt_id;
3、结果数据核对
sql
select * from ads.ads_mem_new_old_user_i where tags_id=86 limit 1;
select * from ads.ads_mem_new_old_user_i where tags_id=87 limit 1;
select * from ads.ads_mem_new_old_user_i where tags_id=88 limit 1;
select
zt_id,datediff(current_date(), to_date(dt)) as day_diff
from dwm.dwm_mem_first_buy_i
where zt_id in (15533440,16681269);
四、RFM标签开发(了解)
1、RFM介绍
比如电商网站要做一次营销活动,需要针对不同价值的客户群体进行分群,对于高价值的用户推荐手表,珠宝等高端商品,对于低价值用户推荐打折促销的廉价商品
在传统企业和电商众多的客户细分模型中,RFM模型是被广泛提到和使用的。
RFM模型是衡量当前用户价值和客户潜在价值的重要工具和手段。
RFM是Rencency(最近一次消费),Frequency(消费频率)、Monetary(消费金额),三个指标首字母组合,如图所示:
2、需求分析
sql
一方面因为这一部分计算使用DSL的写法比较复杂,另一方面为了多一种计算方式,也为了学习基于Spark的数仓建设,这一部分采用Spark SQL的方式进行开发。
在这个项目中,RFM定义如下:
R: 取最近一月时间内,最近一次下单时间距离现在小于等于7天为1,大于7天为0;
F: 取最近一月时间内,下单频次大于等于下单频次均值为1,小于则为0;
M: 取最近一月时间内,下单花费大于等于下单花费均值为1,小于则为0。
下单频次均值 = 全平台的订单总数/总人数
下单花费均值 = 全平台的订单总金额/总人数
注意: 在具体计算时,需要注意这个需求只分析最近一个月有过下单的用户,也就是更早之前下过单最近一个月没有下单的用户不在考虑范围内。
3、代码实现
- Hive建表语句:注意要在Hive中执行
sql
-- 在hive中建库建表
create database if not exists ads;
create table if not exists ads.ads_mem_user_rfm_i
(
user_id bigint comment '乘客id',
tags_id string comment '标签id'
)
comment '乘客rfm价值信息表'
partitioned by (dt string comment '创建日期')
row format delimited fields terminated by ','
stored as orc
tblsql ('orc.compress' = 'SNAPPY')
;
- HiveSQL/SparkSQL标签计算代码
sql
insert overwrite table ads.ads_mem_user_rfm_i
select
mem.zt_id,
if(t.zt_id is null, '121', t.tags_id) as tags_id, -- 121理解为默认标签值
'2025-01-19' as dt
from dwd.dwd_mem_member_union_i mem
left join (
select
zt_id,
case when r = 1 and f = 1 and m = 1 then '114'
when r = 1 and f = 0 and m = 1 then '115'
when r = 0 and f = 1 and m = 1 then '116'
when r = 0 and f = 0 and m = 1 then '117'
when r = 1 and f = 1 and m = 0 then '118'
when r = 1 and f = 0 and m = 0 then '119'
when r = 0 and f = 1 and m = 0 then '120'
when r = 0 and f = 0 and m = 0 then '121'
end as tags_id
from (
select
zt_id,
if(min(r) over (partition by zt_id) < 7, 1, 0) as r, -- 取最小的时间差作为r,如果小于7则为1
if(f > sum_f * 1.00 / user_count, 1, 0) as f, -- 与平均值对比,如果大于平均值则为1
if(m > sum_m * 1.00 / user_count, 1, 0) as m, -- 与平均值对比,如果大于平均值则为1
row_number() over (partition by zt_id order by dt desc ) as rn -- 按照日期进行逆序排列
from (
-- 这里就算出了每个用户的R、F、M的值是多少
select
zt_id,
datediff(current_date(), '2025-01-19') as r, -- 距离当天的时间差
sum(consume_times) over (partition by zt_id) as f, -- 单个用户完单单量
sum(consume_times) over () as sum_f, -- 所有用户完单单量
sum(consume_amount) over (partition by zt_id) as m, -- 单个用户完单金额
sum(consume_amount) over () as sum_m, -- 所有用户完单金额
t2.user_count, -- 总用户数
t1.dt
from dwm.dwm_mem_member_behavior_day_i as t1
cross join (
select count(distinct zt_id) as user_count -- 总用户数
from dwm.dwm_mem_member_behavior_day_i -- 用户行为表:下单、付款、浏览、退货等各种信息
where datediff(current_date(), to_date(dt)) <= 30
) t2
where datediff(current_date(), to_date(t1.dt)) <= 30
) t3
) t4 where rn = 1 -- 只取最新的一条记录
) t on mem.zt_id = t.zt_id
where mem.start_date <= '2025-01-19' and mem.end_date > '2025-01-19';
为什么使用cross join?
- 结果数据核对
sql
select
sum(consume_times)/count(distinct zt_id) as avg_times,-- 平均下单频次均值1.8280254777070064
sum(consume_amount)/count(distinct zt_id) as avg_amount -- 平均下单金额均值60.670076
from dwm.dwm_mem_member_behavior_day_i -- 用户行为表:下单、付款、浏览、退货等各种信息
where datediff(current_date(), to_date(dt)) <= 30;
select * from ads.ads_mem_user_rfm_i where tags_id=116 limit 1; -- 202025
select * from ads.ads_mem_user_rfm_i where tags_id=117 limit 1; -- 1761246
select * from ads.ads_mem_user_rfm_i where tags_id=120 limit 1; -- 1452990
select
zt_id,
datediff(current_date(), '2025-01-19') as r, -- 距离当天的时间差
sum(consume_times) over (partition by zt_id) as f, -- 单个用户完单单量
sum(consume_amount) over (partition by zt_id) as m -- 单个用户完单金额
from dwm.dwm_mem_member_behavior_day_i
where zt_id in (202025,1761246,1452990);
注意:因为SQL中没有实现之前的对ES中旧标签结果数据进行合并和更新的操作,因此需要将多个SQL打标签的结果数据在Hive中合并完以后,再写入到ES用户画像标签结果表中的tags_id_once字段,不能使用tags_id_times
五、SQL执行结果导入ES
首先需要将所有使用Spark SQL计算出的标签进行合并,然后使用SeaTunnel将结果一同导入ES中。
流程如下:
- Hive建表语句
sql
-- 在hive中建库建表
create database if not exists ads;
create table if not exists ads.ads_mem_tags_i
(
user_id bigint comment '乘客id',
tags_id_once string comment '标签id'
)
comment '会员标签表'
partitioned by (dt string comment '创建日期')
row format delimited fields terminated by ','
stored as orc
tblsql ('orc.compress' = 'SNAPPY')
;
- RFM、新老会员标签结果合并:也就是将ads_mem_user_rfm_i、ads_mem_new_old_user_i两个表中的数据继续合并到一起,放到ads_mem_tags_i表中
sql
insert overwrite table ads.ads_mem_tags_i
select
user_id,
concat_ws(',',collect_list(tags_id)) as tags_id_once, -- ",".join(set)
'2025-01-19' as dt
from (
select * from ads.ads_mem_new_old_user_i where dt='2025-01-19'
union all
select * from ads.ads_mem_user_rfm_i where dt='2025-01-19'
) as tmp
group by user_id
- 数据导出:在虚拟机的【/export/server/apache-seatunnel-2.3.5/config/job】目录下,创建配置文件【hive2es.config】。文件内容如下:
sql
cd /export/server/apache-seatunnel-2.3.5/config/job
vim hive2es.config
sql
env {
job.name = "hive2es"
execution.parallelism = 4
job.mode = "BATCH"
checkpoint.interval = 10000
}
source {
Hive {
table_name = "ads.ads_mem_tags_i"
metastore_uri = "thrift://up01:9083"
read_partitions = ["dt="${pt}""]
}
}
transform {
FieldMapper {
field_mapper = {
user_id = user_id
tags_id_once = tags_id_once
}
}
}
sink {
Elasticsearch {
hosts = ["up01:9200"]
index = "user_profile_tags"
primary_keys = ["user_id"]
tls_verify_certificate = "false"
tls_verify_hostname = "true"
max_retry_count = 3
max_batch_size = 1000
}
}
配置文件详细参数解释,查看【07_小兔智购用户画像及推荐系统_统计类标签开发.docm
】文档的【4.5.3】章节
- 启动SeaTunnel
sql
cd /export/server/apache-seatunnel-2.3.5
./bin/seatunnel.sh --config ./config/job/hive2es.config -e local -i pt=2025-01-19
运行成功的结果截图:
可能遇到的错误:
sql
原因: 当前SeaTunnel的安装目录下缺少与ElasticSearch整合的驱动包
解决: 将connector-elasticsearch-2.3.5.jar驱动包上传到/export/server/apache-seatunnel-2.3.5/connectors目录下。重新运行即可。下载链接:https://repo.maven.apache.org/maven2/org/apache/seatunnel/connector-elasticsearch/2.3.5/
![](https://i-blog.csdnimg.cn/direct/6a56be3381894fb2bfbe0c0c9a821f98.png)
把能做的统计标签做了
![](https://i-blog.csdnimg.cn/direct/2169abc632194978aa137bd6b8713b75.png)