Pyspark中的int

在 PySpark 中，整数类型（int）与 Python 或 Pandas 中的 int 有所不同，因为它基于 Spark SQL 的数据类型系统。以下是 PySpark 中整数类型的详细说明：

1. PySpark 的整数类型

PySpark 主要使用 IntegerType （32位）和 LongType （64位）表示整数，对应 SQL 中的 INT 和 BIGINT：

PySpark 类型	SQL 类型	位数	取值范围	占用存储
`IntegerType`	`INT`	32位	`-2,147,483,648` 到 `2,147,483,647`	4 字节
`LongType`	`BIGINT`	64位	`-9,223,372,036,854,775,808` 到 `9,223,372,036,854,775,807`	8 字节

2. 如何指定整数类型？

在 PySpark 中，可以通过 StructType 或 withColumn 显式指定整数类型：

(1) 创建 DataFrame 时指定

python 复制代码

from pyspark.sql.types import IntegerType, LongType
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("int_example").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]

# 方式1：使用 StructType 定义 Schema
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)  # 使用 IntegerType（32位）
])

df = spark.createDataFrame(data, schema)
df.printSchema()

输出：

python 复制代码

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)  # 32位整数

(2) 转换列类型

python 复制代码

from pyspark.sql.functions import col

# 将 age 列从 IntegerType 转为 LongType（64位）
df = df.withColumn("age", col("age").cast("long"))  # 或 LongType()
df.printSchema()

输出：

python 复制代码

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)  # 64位整数

3. 默认整数类型

PySpark 默认推断整数为 IntegerType（32位）：
- 如果数值在 -2,147,483,648 到 2,147,483,647 之间，PySpark 会使用 IntegerType。
- 如果超出范围，会自动转为 LongType（64位）。

示例：

python 复制代码

data = [("A", 100), ("B", 3000000000)]  # 3000000000 超出 32位范围
df = spark.createDataFrame(data, ["name", "value"])
df.printSchema()

输出：

python 复制代码

root
 |-- name: string (nullable = true)
 |-- value: long (nullable = true)  # 自动转为 LongType

4. 如何选择 `IntegerType` 还是 `LongType`？

场景	推荐类型	原因
内存优化	`IntegerType`	32位比 64位节省 50% 存储空间
大数值需求	`LongType`	避免溢出（如 ID、时间戳、大金额）
兼容性	`LongType`	某些数据库（如 MySQL 的 `BIGINT`）需要 64位

5. 常见问题

(1) PySpark 的 `int` 和 Python 的 `int` 有什么区别？

Python int ：在 64 位系统上是 int64（无限制大小）。
PySpark IntegerType ：固定 32 位，类似 C/Java 的 int。

(2) 如何检查列的类型？

python 复制代码

df.schema["age"].dataType  # 返回 IntegerType 或 LongType

(3) 为什么有时 PySpark 会自动转 `LongType`？

如果数值超出 IntegerType 范围（±21亿），PySpark 会自动升级为 LongType。

6. 总结

特性	`IntegerType` (32位)	`LongType` (64位)
存储	4 字节	8 字节
范围	±21亿	±922亿亿
默认行为	小整数默认使用	大整数自动升级
适用场景	内存优化、中小数值	大数值、ID、时间戳

推荐做法：

如果数据范围明确且较小，优先用 IntegerType 节省内存。
如果处理 ID、时间戳或不确定范围，用 LongType 避免溢出。

1. PySpark 的整数类型

2. 如何指定整数类型？

(1) 创建 DataFrame 时指定

(2) 转换列类型

3. 默认整数类型

4. 如何选择 IntegerType 还是 LongType？

5. 常见问题

(1) PySpark 的 int 和 Python 的 int 有什么区别？

(2) 如何检查列的类型？

(3) 为什么有时 PySpark 会自动转 LongType？

6. 总结

4. 如何选择 `IntegerType` 还是 `LongType`？

(1) PySpark 的 `int` 和 Python 的 `int` 有什么区别？

(3) 为什么有时 PySpark 会自动转 `LongType`？