Pyspark dataframe
df.foreach 逐条执行
df.foreach() == df.rdd.foreach()
df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+
def func(row):
print(row.name)
# row对象进入func执行
df.foreach(func)
Alice
Bob
foreachPartition 按分区逐条执行
df.show()
+---+-----+
|age| name|
+---+-----+
| 14| Tom|
| 23|Alice|
| 16| Bob|
+---+-----+
def func(itr):
for person in itr:
print(person.name)
df.foreachPartition(func)
Tom
Alice
Bob
freqltems
df = spark.createDataFrame([(1, 11), (1, 11), (3, 10), (4, 8), (4, 8)], ["c1", "c2"])
df.show()
+---+---+
| c1| c2|
+---+---+
| 1| 11|
| 1| 11|
| 3| 10|
| 4| 8|
| 4| 8|
+---+---+
df.freqItems(["c1", "c2"]).show()
+------------+------------+
|c1_freqItems|c2_freqItems|
+------------+------------+
| [1, 3, 4]| [8, 10, 11]|
+------------+------------+
groupBy 分组
df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 2| Bob|
| 2| Bob|
| 5| Bob|
+---+-----+
df.groupBy("name").agg({"age": "sum"}).show()
+-----+--------+
| name|sum(age)|
+-----+--------+
| Bob| 9|
|Alice| 2|
+-----+--------+
df.groupBy("name").agg({"age": "max"}).withColumnRenamed('max(age)','new_age').sort('new_age').show()
+-----+-------+
| name|new_age|
+-----+-------+
|Alice| 2|
| Bob| 5|
+-----+-------+
head 获取指定数量开头
df.head(2)
[Row(age=2, name='Alice'), Row(age=2, name='Bob')]
hint 未确定
df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"])
df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")])
df.join(df2, "name").explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [name#1641, age#1640L, height#1644L]
+- SortMergeJoin [name#1641], [name#1645], Inner
:- Sort [name#1641 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(name#1641, 200), ENSURE_REQUIREMENTS, [plan_id=1916]
: +- Filter isnotnull(name#1641)
: +- Scan ExistingRDD[age#1640L,name#1641]
+- Sort [name#1645 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(name#1645, 200), ENSURE_REQUIREMENTS, [plan_id=1917]
+- Filter isnotnull(name#1645)
+- Scan ExistingRDD[height#1644L,name#1645]
df.join(df2.hint("broadcast"), "name").explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [name#1641, age#1640L, height#1644L]
+- BroadcastHashJoin [name#1641], [name#1645], Inner, BuildRight, false
:- Filter isnotnull(name#1641)
: +- Scan ExistingRDD[age#1640L,name#1641]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false]),false), [plan_id=1946]
+- Filter isnotnull(name#1645)
+- Scan ExistingRDD[height#1644L,name#1645]