PySpark之金融数据分析(Spark RDD、SQL练习题)

目录

一、数据来源

[二、PySpark RDD编程](#二、PySpark RDD编程)

1、查询特定日期的资金流入和流出情况

2、活跃用户分析

[三、PySpark SQL编程](#三、PySpark SQL编程)

1、按城市统计2014年3月1日的平均余额

2、统计每个城市总流量前3高的用户

四、总结


一、数据来源

本文使用的数据来源于天池大赛数据集,由蚂蚁金服提供,包含用户基本信息、申购赎回记录、收益率、银行间拆借利率等多个维度,本文通过PySpark实现对该数据集的简单分析。
数据来源:天池-资金流入流出预测-挑战Baseline

二、PySpark RDD编程

数据都已上传到HDFS的/data目录下,对于分析结果,保存至/output目录下。

1、查询特定日期的资金流入和流出情况

使用user_balance_table,计算出所有用户在每一天的总资金流入和总资金流出量。

输出格式如下:

<日期> <资金流入量> <资金流出量>

代码如下:

python 复制代码
from pyspark.sql.functions import col, mean, round, row_number, sum
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from datetime import datetime

# 创建对象
conf = SparkConf().setAppName("data analysis")
sc = SparkContext(conf=conf)

# 读取CSV文件为RDD
lines_rdd = sc.textFile("/data/user_balance_table.csv")
# 获取表头
header = lines_rdd.first()
data_rdd = lines_rdd.filter(lambda row: row != header)
# 每行转为列表
split_rdd = data_rdd.map(lambda line: line.split(","))

# 提取列索引
extracted_rdd = split_rdd.map(lambda fields: (fields[1], (int(fields[4]), int(fields[8]))))
# 分组聚合,分别计算流入和流出总额
aggregated_rdd = extracted_rdd.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
# 结果转换
result_rdd = aggregated_rdd.map(lambda x: "\t".join([x[0], str(x[1][0]), str(x[1][1])]))

# 插入表头行
final_rdd = sc.parallelize(["日期\t资金流入量\t资金流出量"]) + result_rdd
final_rdd.saveAsTextFile("/output/result1.txt")

print("已完成1、查询特定日期的资金流入和流出情况")

输出结果如下:

python 复制代码
日期	资金流入量	资金流出量
20140805	394780870	221706539
20140808	233903717	311648757
20140811	331550471	418603336
20140814	257702660	211939431
20140820	308378692	202452782
20140823	141412027	199377531
20140826	306945089	285478563
20140829	267554713	273756380
20140830	199708772	196374134
20140827	302194801	468164147
20140821	251763517	219963356
20140818	298499146	259169016
20140815	244551620	236516007
20140812	258493673	309754858
20140803	173825397	127125217
20140728	371762756	345986909
20140725	181641088	262874791
20140722	243084133	369043423
20140719	210318023	155464283
20140716	394890140	234775948
20140713	179759885	199459990
20140710	283095921	326009240
20140707	272182847	317612569
20140409	383347565	289330278
20140403	363877120	266605457
20140331	398884905	423852634
20140328	225966355	405443946
20140831	275090213	292943033
20140825	309574223	312413411
20140822	246316056	179349206
20140816	215059736	219214339
20140813	261506619	303975517
20140810	259534870	189909225
20140804	330640884	322907524
20140729	228093046	303480103
20140723	265461894	308353077
20140717	253011280	298279385
20140714	254797524	284753279
20140711	208671021	240050748
20140708	224240103	340453063
20140630	334054112	456547794
20140321	282351818	259655286
20140324	313180334	437825259
20140402	355347118	272612066
20140414	309853269	415986984
20140417	355792647	265341592
20140420	191259529	161057781
20140501	193045106	143362755
20140504	303087562	413222034
20140513	275241493	257918375
20140519	259077930	293791406
20140522	344636549	251108485
20140603	270887462	385622582
20140612	332365185	236467885
20140615	166080126	116623756
20140323	167456369	186443311
20140326	272935544	450254233
20140329	160250985	155006056
20140407	196936223	176966561
20140410	386567460	286914864
20140416	387847838	255914640
20140422	285248757	268810141
20140425	220927432	227764292
20140506	318002728	341108696
20140515	313367089	238307643
20140524	160073254	154409868
20140527	299049555	321965845
20140530	226547701	312802179
20140605	380042567	355645445
20140614	181574530	229916191
20140623	232227670	373779624
20140415	428681231	285293076
20140427	146837951	191915377
20140502	125336258	121222064
20140505	370924149	309330781
20140508	392838756	273187499
20140514	305474522	316225442
20140517	170983868	145854372
20140526	344890868	274707572
20140529	320549863	313827800
20140622	147180819	361285652
20140625	264663201	547295931
20140628	153826161	283400236
20140706	199569025	195530758
20140709	278005555	269642881
20140712	177644343	149081488
20140718	208959595	208671287
20140724	277044480	347622431
20140727	151406251	166610652
20140730	209917272	250117716
20140802	189092130	172250225
20140217	706118060	405809267
20131120	166163002	86951218
20131111	147986439	227124951
20131102	62132300	26705506
20131024	69530830	58183921
20131015	85704304	35099198
20131012	81901136	31322983
20131013	71149981	31850475
20131029	102050144	43052468
20131104	300027403	130970051
20131110	164813471	85293644
20131113	174554976	79121440
20131116	118085705	28996272
20131119	131365419	70308235
20131122	128528687	82717921
20131125	147143602	70113127
20131203	268666856	101662045
20131209	174652812	133355299
20131214	129045596	52234651
20131208	124886903	76314292
20131205	190825287	118256819
20131124	102394991	85555947
20131121	221166938	74733962
20131118	195766477	136916782
20131112	148469647	131263524
20131106	181474910	131707820
20131028	129993375	55089464
20131025	69936453	52225802
20131022	145611496	96172537
20131019	53148443	26684697
20140216	331324628	183057224
20131216	196489931	156157326
20131213	164672690	90778214
20131210	179471612	147147341
20131207	138882605	71916047
20131123	89109018	48882131
20140223	337053711	195214052
20140313	346286910	316973542
20140310	497338076	308040624
20140317	339008082	440624592
20140314	315897431	311575572
20140311	430500816	496039886
20140302	276202230	246199417
20140225	563505889	299155352
20140309	244752519	206312503
20140315	287407002	242799048
20140318	435479117	410240726
20140221	353981687	178914448
20140218	593563145	271125324
20140212	763009770	354585919
20140209	341911234	262954049
20140206	227860276	111635336
20140205	209043990	95407704
20140208	358255468	183432237
20140211	818803205	243058162
20140220	478925191	230855410
20140207	578818221	273482213
20140201	64287172	61527060
20140117	345426853	127385210
20140116	374947083	153167426
20140119	185082022	117629351
20140125	440854623	185524856
20140127	657199484	280645861
20131220	133867400	153678963
20131217	193980797	116537843
20140112	228255344	141505812
20131221	112000779	83151594
20131222	154543217	83998249
20131228	192808006	61383220
20140102	434956739	190155450
20140105	206030707	156781996
20140113	447207050	178923772
20140110	237797636	117259153
20140107	589726496	137972793
20131224	305406334	118607911
20130824	44367268	14316598
20131011	82674757	27666473
20131002	11562708	7233874
20130930	28398050	30181353
20130921	32406340	26596179
20130906	38923186	33417903
20130831	47655303	22012016
20130828	46602319	33696861
20130816	41683536	33739844
20130813	48720919	28790328
20130804	45745254	8263965
20130729	53512076	18599364
20130723	58136147	24404051
20131010	72094613	52244531
20130820	53675509	30131225
20130817	17670519	4674983
20130811	55519543	15372680
20130805	43632203	15797507
20130730	47481243	13048582
20130724	48422518	36258592
20130725	57433418	38212836
20130731	54569637	9534040
20130803	21595789	20701797
20130806	50866184	16401387
20130815	42421222	52659327
20130818	59703092	15218300
20130827	45268028	60137633
20130830	56488844	20630752
20130908	38796909	14919122
20130911	98944459	71674082
20130914	30975443	36312660
20131004	19907689	23412350
20130904	33366708	27452600
20130907	34076183	20344424
20130910	94684143	37128363
20130919	24778048	11418512
20131003	10101194	6223263
20131009	72422299	33999376
20130711	44075197	3508800
20130714	22615303	2784107
20130720	20439079	4601143
20130718	24234505	11765356
20130715	48128555	13107943
20130712	34183904	8492573
20130706	36751272	1616635
20130710	30696506	2597169
20130713	15164717	3482829
20130719	33680124	9244769
20130722	40448896	19144267
20140817	149978271	139564084
20140824	130195484	191080151
20140809	160262764	163611708
20140806	288821016	282346594
20140731	191728916	277194379
20140704	211649838	264494550
20140412	177642053	123295320
20140406	129477254	139576683
20140325	314345006	312710515
20140322	191700135	138039412
20140828	245082751	297893861
20140819	266401973	254929877
20140807	247646474	253659514
20140801	374884735	252540858
20140726	128268053	282653341
20140720	176449304	174462836
20140705	169383796	272535138
20140702	384555819	328950951
20140327	266231082	359071642
20140330	205533934	264714811
20140405	202336542	163199682
20140408	354770149	250015131
20140411	237829882	277077434
20140423	313677307	278470936
20140426	151625415	158122962
20140429	330607104	307578349
20140507	417327518	239372999
20140510	287240171	147248328
20140516	231967423	282094916
20140525	166943526	231004758
20140528	276134813	415891684
20140531	146823669	142862063
20140606	301413900	274862067
20140609	366114374	287520152
20140618	335262709	421920230
20140621	177999186	231003775
20140624	245450766	428471509
20140627	264282703	399444352
20140320	365011495	336076380
20140401	453320585	277429358
20140404	251895894	200192637
20140413	208172985	178934722
20140419	268729366	146374940
20140428	324937272	327724735
20140503	185094488	199568043
20140509	281479009	247743971
20140512	325108597	293952908
20140518	164419642	153440019
20140521	297799722	250223726
20140602	158219402	170409506
20140608	302171269	169525332
20140611	327661453	246127540
20140617	270350693	502560223
20140620	251582530	286583065
20140626	297628039	418742109
20140629	158801540	261170799
20140701	384428753	374164541
20140418	239300383	225952909
20140421	301134667	295635256
20140424	318358891	224536754
20140430	260091330	281835975
20140511	182424063	152945581
20140520	453955303	260040720
20140523	249546195	229000787
20140601	183489775	149829253
20140604	274460744	303978838
20140607	187801995	146203577
20140610	354031597	298190025
20140613	216923770	386799040
20140616	387308484	492349489
20140619	338609087	284956260
20140703	297894643	289009780
20140715	334810012	261722182
20140721	378088594	434191479
20140213	626698794	362757986
20131117	111786622	42358155
20131114	167996058	52554150
20131108	231749471	92856307
20131105	180904814	82689864
20131030	109663260	63908701
20131027	68007543	46520570
20131021	106030290	45781403
20131018	98151450	15938355
20131014	130315300	40652625
20131017	87203217	23274764
20131020	47766681	50884342
20131023	106093112	73788462
20131026	55752506	49357211
20131101	150904945	66933119
20131107	167963464	83306452
20131128	139760425	61453591
20131206	152197159	82625571
20131212	140360007	92090275
20131215	205882387	54757146
20140215	255393179	121463693
20131211	239916977	110314592
20131202	345313258	128465086
20131130	116865492	119660311
20131127	134774128	86129498
20131115	124270704	58210254
20131109	126467028	45885771
20131103	68707870	30895754
20131031	90926701	55607833
20131016	100948091	51261982
20131204	219456372	69981329
20131201	137889992	94632233
20131129	122536344	123982166
20131126	258037702	65802983
20140226	680801145	284819682
20140301	362865580	211279011
20140304	524146340	250562978
20140319	359014064	429298917
20140316	269387391	390584547
20140307	380139779	291087220
20140308	243274169	140323202
20140305	454295491	209072753
20140227	672909288	492786036
20140224	656317045	473470156
20140204	83653646	44927333
20140222	282022706	134735118
20140228	428721754	322030204
20140303	505305862	513017360
20140306	561787770	243149884
20140312	377007897	416491268
20140203	84329848	46109194
20140131	87324175	48132389
20140202	57912761	24395720
20140214	490710434	263261899
20140219	408367966	323990451
20140210	647465140	431753411
20140126	528402520	331752519
20130929	34433848	40295202
20140129	952479658	215164666
20140123	481450097	209024328
20140120	335761027	442996084
20140114	356907128	159778389
20130721	21142394	2681331
20140122	498730606	292856188
20140128	843424742	248334316
20140130	227916211	105288105
20140124	614103894	345829001
20140121	412854611	302825136
20140118	233200688	142869842
20140115	391408718	177618247
20131229	161473347	55129989
20131226	220287409	241852564
20140103	342074805	127714255
20131223	198747367	138478245
20140106	442494042	190917629
20140109	280698487	140391237
20131218	180215804	202220872
20131219	233631357	164023344
20131225	380076148	172755480
20131231	265851934	134232530
20140108	264025160	213880074
20140111	243403438	71530182
20140104	186085910	99869074
20140101	330926565	77367755
20131230	286358778	179826624
20131227	211118967	163522548
20130925	47274168	74838162
20131008	109806912	77856096
20131005	26364686	13966453
20130927	42166121	27639603
20130924	74250859	56680792
20130918	42926608	47203221
20130915	37059683	25020715
20130912	68573684	45147220
20130909	49312473	45621186
20130903	80507880	39328144
20130825	43306288	18117808
20130822	89130737	25000672
20130819	109184209	28339260
20130810	20615307	26614408
20130807	43908081	29708706
20130801	53369962	18864468
20130726	44721817	39192369
20130829	47518666	35944968
20130901	56239802	59339949
20130826	74496541	20637986
20130823	37878575	25680323
20130814	30541167	15031683
20130808	44493490	29551691
20130802	38064536	40150769
20130727	17194451	15058893
20130728	36255382	7683211
20130809	33425186	30131015
20130812	94520502	28586669
20130821	38184446	29983342
20130902	140844739	17785524
20130905	38716505	21800904
20130917	76204815	58260798
20130920	28365762	15188254
20130923	54658160	58285879
20130926	68718693	48467529
20131001	19137499	12813267
20131007	42797733	28894400
20130913	71655946	60512675
20130916	161656210	45184589
20130922	67514169	47962055
20130928	39995798	58105720
20131006	38730653	13464528
20130717	29015682	10911513
20130705	11648749	2763587
20130702	29037390	2554548
20130708	57258266	8347729
20130709	26798941	3473059
20130703	27270770	5953867
20130701	32488348	5525022
20130704	18321185	6410729
20130707	8962232	3982735
20130716	50622847	11864981

2、活跃用户分析

使用 user_balance_table ,定义活跃用户为在指定月份内有至少5天记录的用户,统计2014年8月的活跃用户总数。

输出格式如下:

<活跃用户总数>

代码如下:

python 复制代码
# 提取user_id、report_date字段
mapped_rdd = split_rdd.map(lambda fields: (fields[0], datetime.strptime(fields[1], "%Y%m%d")))
# 筛选出2014年8月的记录
filtered_rdd = mapped_rdd.filter(lambda x: x[1].year == 2014 and x[1].month == 8)
# 聚合
grouped_rdd = filtered_rdd.groupByKey()
# 统计记录数量,过滤活跃用户
active_users_rdd = grouped_rdd.filter(lambda x: len(list(x[1])) >= 5)

# 统计活跃用户总数
total_active_users = active_users_rdd.map(lambda x: x[0]).distinct().count()

print("已完成2、活跃用户分析\n2014年8月期间的活跃用户总数为:", total_active_users)

三、PySpark SQL编程

1、按城市统计2014年3月1日的平均余额

计算每个城市在2014年3月1日的用户平均余额(tBalance),按平均余额降序排列。

输出格式如下:

<城市ID> <平均余额>

代码如下:

python 复制代码
spark = SparkSession.builder.getOrCreate()
# 从HDFS读取CSV文件
df_bal = spark.read.csv("/data/user_balance_table.csv", header=True, inferSchema=True)
df_city = spark.read.csv("/data/user_profile_table.csv", header=True, inferSchema=True)

df = df_bal.select("user_id", "report_date", "tBalance")\
           .filter(col("report_date") == "20140301")\
           .join(df_city.select("user_id","city"), on="user_id")

result_df = df.groupBy("city").agg(mean(col("tBalance")).cast("int").alias("tBalance_mean"))\
           .sort(col("tBalance_mean").desc())

# 将结果转换为RDD,并进行格式调整(添加表头,并以tab分隔每列)
data_rdd = result_df.rdd.map(lambda row: "\t".join(map(str, row)))
result2 = sc.parallelize(["城市ID\t平均余额"] + data_rdd.collect())
result2.saveAsTextFile("/output/result2.txt")
print("已完成1、按城市统计2014年3月1日的平均余额")

输出结果如下:

python 复制代码
城市ID	平均余额
6281949	2795923
6301949	2650775
6081949	2643912
6481949	2087617
6411949	1929838
6412149	1896363
6581949	1526555

2、统计每个城市总流量前3高的用户

统计每个城市中每个用户在2014年8月的总流量(定义为total purchase_amt+total_redeem_amt),并输出每个城市总流量排名前三的用户ID及其总流量。

输出格式如下:

<城市ID> <用户ID> <总流量>

代码如下:

python 复制代码
windowSpec = Window.partitionBy("city").orderBy(col("total_amt").desc())
df = df_bal.select("user_id", "report_date", "total_purchase_amt", "total_redeem_amt")\
           .filter(df_bal.report_date.like("201408%"))\
           .join(df_city.select("user_id","city"), on="user_id")

result_df = df.groupBy("city", "user_id") \
            .agg((sum(col("total_purchase_amt")) + sum(col("total_redeem_amt"))).alias("total_amt"))\
            .filter(col("total_amt") > 0)\
            .withColumn("rn", row_number().over(windowSpec))\
            .filter(col("rn") <= 3)\
            .sort(col("city"), col("rn"))\
            .select("city", "user_id", "total_amt")

# 输出结果
data_rdd = result_df.rdd.map(lambda row: "\t".join(map(str, row)))
result3 = sc.parallelize(["城市ID\t用户ID\t总流量"] + data_rdd.collect())
result3.saveAsTextFile("/output/result3.txt")
print("已完成2、统计每个城市总流量前3高的用户")

sc.stop()
spark.stop()

输出结果如下:

python 复制代码
城市ID	用户ID	总流量
6081949	27235	108475680
6081949	27746	76065458
6081949	18945	55304049
6281949	15118	149311909
6281949	11397	124293438
6281949	25814	104428054
6301949	2429	109171121
6301949	26825	95374030
6301949	10932	74016744
6411949	662	75162566
6411949	21030	49933641
6411949	16769	49383506
6412149	22585	200516731
6412149	14472	138262790
6412149	25147	70594902
6481949	12026	51161825
6481949	670	49626204
6481949	14877	34488733
6581949	9494	38854436
6581949	26876	23449539
6581949	21761	21136440

四、总结

RDD编程主要练习了filter、map、reduceByKey、saveAsTextFile、groupByKey等算子的使用,Spark SQL编程主要练习了DataFrame操作、聚合函数、窗口函数等内容。

相关推荐
查士丁尼·绵2 小时前
面试-字符串1
python
Ase5gqe2 小时前
大数据-259 离线数仓 - Griffin架构 修改配置 pom.xml sparkProperties 编译启动
xml·大数据·架构
村口蹲点的阿三2 小时前
Spark SQL 中对 Map 类型的操作函数
javascript·数据库·hive·sql·spark
史嘉庆2 小时前
Pandas 数据分析(二)【股票数据】
大数据·数据分析·pandas
小兜全糖(xdqt)3 小时前
python中单例模式
开发语言·python·单例模式
Python数据分析与机器学习3 小时前
python高级加密算法AES对信息进行加密和解密
开发语言·python
noravinsc3 小时前
python md5加密
前端·javascript·python
fajianchen3 小时前
记一次线上SQL死锁事故:如何避免死锁?
数据库·sql
chengpei1473 小时前
实现一个自己的spring-boot-starter,基于SQL生成HTTP接口
java·数据库·spring boot·sql·http