文章目录
- [pyspark dataframe Column](#pyspark dataframe Column)
-
- [alias 重命名列(name)](#alias 重命名列(name))
- [astype alias cast 修改列类型](#astype alias cast 修改列类型)
- [between 筛选](#between 筛选)
- [bitwiseAND 按位运](#bitwiseAND 按位运)
- [contains 包含元素](#contains 包含元素)
- [withField 字段表达式操作](#withField 字段表达式操作)
- [dropFields 删除字段](#dropFields 删除字段)
- [getField 获取列](#getField 获取列)
- [startswitch 字符串开头](#startswitch 字符串开头)
- [endswith 字符串结尾](#endswith 字符串结尾)
- [like 模糊匹配](#like 模糊匹配)
- [ilike 模糊匹配不分大小写](#ilike 模糊匹配不分大小写)
- [rlike 正则匹配](#rlike 正则匹配)
- [isNotNull 列不为控制](#isNotNull 列不为控制)
- [isNull 列为空](#isNull 列为空)
- [isin 根据范围筛选](#isin 根据范围筛选)
- [when 与 otherwise 配合使用](#when 与 otherwise 配合使用)
- [substr 返回列的字串](#substr 返回列的字串)
pyspark dataframe Column
alias 重命名列(name)
df = spark.createDataFrame(
[(2, "Alice"), (5, "Bob")], ["age", "name"])
df.select(df.age.alias("age2")).show()
+----+
|age2|
+----+
| 2|
| 5|
+----+
astype alias cast 修改列类型
data.schema
StructType([StructField('name', StringType(), True), StructField('age', StringType(), True), StructField('id', StringType(), True), StructField('gender', StringType(), True)])
data.printSchema()
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)
# 增加一列使用cast修改类型
from pyspark.sql.types import LongType
data.withColumn('age2',data['age'].cast(LongType())).show()
+-----+---+---+------+----+
| name|age| id|gender|age2|
+-----+---+---+------+----+
| ldsx| 12| 1| 男| 12|
|test1| 20| 1| 女| 20|
|test2| 26| 1| 男| 26|
|test3| 19| 1| 女| 19|
|test4| 51| 1| 女| 51|
|test5| 13| 1| 男| 13|
+-----+---+---+------+----+
# 查看scheam
data.withColumn('age2',data['age'].cast(LongType())).schema
StructType([StructField('name', StringType(), True), StructField('age', StringType(), True), StructField('id', StringType(), True), StructField('gender', StringType(), True), StructField('age2', LongType(), True)])
data.withColumn('age2',data['age'].cast(LongType())).printSchema()
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)
|-- age2: long (nullable = true)
# 结合别名使用
data.select(df.age.cast("string").alias('ages')).collect()
between 筛选
df.select('name',df.age.between(1,5)).show()
+-----+---------------------------+
| name|((age >= 1) AND (age <= 5))|
+-----+---------------------------+
|Alice| true|
| Bob| true|
+-----+---------------------------+
df.select('name',df.age.between(1,5).alias('ldsx')).show()
+-----+----+
| name|ldsx|
+-----+----+
|Alice|true|
| Bob|true|
+-----+----+
# filter 通过bool值过滤
df.filter(df.age.between(1,5)).show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+
bitwiseAND 按位运
df = spark.createDataFrame([Row(a=170, b=75)])
df.show()
+---+---+
| a| b|
+---+---+
|170| 75|
+---+---+
# 列于列按位计算
df.select(df.a.bitwiseAND(df.b)).show()
+-------+
|(a & b)|
+-------+
| 10|
+-------+
contains 包含元素
包含其他元素。根据字符串匹配返回一个布尔列。
data.select(data.name.contains('test')).show()
+--------------------+
|contains(name, test)|
+--------------------+
| false|
| true|
| true|
| true|
| true|
| true|
+--------------------+
data.filter(data.name.contains('test')).show()
+-----+---+---+------+
| name|age| id|gender|
+-----+---+---+------+
|test1| 20| 1| 女|
|test2| 26| 1| 男|
|test3| 19| 1| 女|
|test4| 51| 1| 女|
|test5| 13| 1| 男|
+-----+---+---+------+
withField 字段表达式操作
按名称添加/替换StructType中字段的表达式。按名称添加/替换StructType中字段的表达式。
df = spark.createDataFrame([Row(a=Row(b=1, c=2))])
df.show()
+------+
| a|
+------+
|{1, 2}|
+------+
# 替换df中a类中row对象b的值
df.withColumn('a',df['a'].withField('b',lit(10086))).show()
+----------+
| a|
+----------+
|{10086, 2}|
+----------+
dropFields 删除字段
StructType中字段的表达式。如果架构不包含字段名,则此操作无效。
df = spark.createDataFrame([Row(a=Row(b=1, c=2, d=3, e=Row(f=4, g=5, h=6)))])
df.show()
+--------------------+
| a|
+--------------------+
|{1, 2, 3, {4, 5, 6}}|
+--------------------+
# 修改a列 处理row对象中 b,c
df.withColumn('a', df['a'].dropFields('b', 'c')).show()
+--------------+
| a|
+--------------+
|{3, {4, 5, 6}}|
+--------------+
# 处理row对象中的嵌套内容
df.withColumn("a", col("a").dropFields("e.g", "e.h")).show()
+--------------+
| a|
+--------------+
|{1, 2, 3, {4}}|
+--------------+
getField 获取列
在StructType中按名称获取字段的表达式
df = spark.createDataFrame([Row(r=Row(a=1, b="b"))])
df.show()
+------+
| r|
+------+
|{1, b}|
+------+
df.select(df.r.getField("b")).show()
+---+
|r.b|
+---+
| b|
+---+
df.select(df.r.b).show()
+---+
|r.b|
+---+
| b|
+---+
startswitch 字符串开头
字符串以开头。根据字符串匹配返回一个布尔列。
endswith 字符串结尾
字符串以结尾。根据字符串匹配返回一个布尔列。
df = spark.createDataFrame(
[(2, "Alice"), (5, "Bob")], ["age", "name"])
PyDev console: starting.
df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+
df.filter(df.name.endswith('ice')).collect()
[Row(age=2, name='Alice')]
df.select(df.name.endswith('ice')).show()
+-------------------+
|endswith(name, ice)|
+-------------------+
| true|
| false|
+-------------------+
like 模糊匹配
df.filter(df.name.ilike('ali%')).show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
+---+-----+
df.filter(df.name.like('ali%')).show()
+---+----+
|age|name|
+---+----+
+---+----+
ilike 模糊匹配不分大小写
不分大小写
df = spark.createDataFrame(
[(2, "Alice"), (5, "Bob")], ["age", "name"])
df.filter(df.name.ilike('%Ice')).show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
+---+-----+
df.select(df.name.ilike('%Ice')).show()
+-----------------+
|ilike(name, %Ice)|
+-----------------+
| true|
| false|
+-----------------+
rlike 正则匹配
df = spark.createDataFrame(
[(2, "Alice"), (5, "Bob")], ["age", "name"])
df.filter(df.name.rlike('ice$')).collect()
[Row(age=2, name='Alice')]
isNotNull 列不为控制
df = spark.createDataFrame([Row(name='Tom', height=80), Row(name='Alice', height=None)])
df.filter(df.height.isNotNull()).show()
+----+------+
|name|height|
+----+------+
| Tom| 80|
+----+------+
df.select(df.height.isNotNull()).show()
+--------------------+
|(height IS NOT NULL)|
+--------------------+
| true|
| false|
+--------------------+
isNull 列为空
df = spark.createDataFrame([Row(name='Tom', height=80), Row(name='Alice', height=None)])
df.filter(df.height.isNull()).show()
+-----+------+
| name|height|
+-----+------+
|Alice| null|
+-----+------+
df.select(df.height.isNull()).show()
+----------------+
|(height IS NULL)|
+----------------+
| false|
| true|
+----------------+
isin 根据范围筛选
df.select( df.age.isin([1,2,3])).show()
+------------------+
|(age IN (1, 2, 3))|
+------------------+
| true|
| false|
+------------------+
df.filter( df.age.isin([1,2,3])).show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
+---+-----+
when 与 otherwise 配合使用
如果未调用Column.otherwise(),则对于不匹配的条件将返回None
df = spark.createDataFrame(
[(2, "Alice"), (5, "Bob")], ["age", "name"])
df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+
# 查询条件进行筛选,当when不配合otherwise 默认使用null代替
df.select(df.name, when(df.age > 3, 1)).show()
+-----+------------------------------+
| name|CASE WHEN (age > 3) THEN 1 END|
+-----+------------------------------+
|Alice| null|
| Bob| 1|
+-----+------------------------------+
# 使用otherwise 的条件代替null
df.select(df.name, when(df.age > 3, 1).otherwise(0)).show()
+-----+-------------------------------------+
| name|CASE WHEN (age > 3) THEN 1 ELSE 0 END|
+-----+-------------------------------------+
|Alice| 0|
| Bob| 1|
+-----+-------------------------------------+
substr 返回列的字串
索引0,1都为从首位开始 (start,end)
df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+
# 返回
df.select(df.name.substr(1, 2)).show()
+---------------------+
|substring(name, 1, 2)|
+---------------------+
| Al|
| Bo|
+---------------------+
df.select(df.name.substr(2, 10).alias("col")).show()
+----+
| col|
+----+
|lice|
| ob|
+----+