pandas销售数据分析
数据保存在data目录
- 消费者数据:customers.csv
- 商品数据:products.csv
- 交易数据:transactions.csv
customers.csv数据结构:
字段 | 描述 |
---|---|
customer_id | 客户ID |
gender | 性别 |
age | 年龄 |
region | 地区 |
membership_date | 会员日期 |
products.csv数据结构:
字段 | 描述 |
---|---|
product_id | 产品ID |
category | 产品类别 |
brand | 品牌 |
price | 价格 |
transactions.csv数据结构:
字段 | 描述 |
---|---|
transaction_id | 交易ID |
customer_id | 客户ID |
product_id | 产品ID |
quantity | 购买数量 |
transaction_date | 交易日期 |
price | 交易价格 |
amount | 交易金额 |
加载CSV数据,编写代码完成以下需求:
- 计算每个客户的总消费金额
- 计算每个客户的平均订单金额
- 按产品类别统计销售总额和销售量
- 按性别统计客户数量
- 创建年龄分布直方图数据
- 计算每个月的销售总额(时间序列分析)
- 找出最畅销的10种产品
- 找出消费最高的10个客户
- 计算不同品牌产品的平均价格
- 创建产品类别和性别之间的交叉表
- 创建产品类别和年龄组之间的交叉表
- 创建区域和产品类别之间的交叉表
- 创建性别和区域之间的交叉表
- 计算每个客户的首次购买日期和最近购买日期
- 计算客户生命周期价值(CLV)假设为一年
- 创建一个透视表,显示每个区域、每个类别的销售总额
- 创建一个透视表,显示每个月、每个类别的销售总额
- 创建一个透视表,显示每个区域、每个性别在各个类别上的平均消费
- 计算每个客户的购买频率(每年购买次数)
- 分析会员时长与消费金额之间的关系
导包
python
import pandas as pd # 导入pandas库,用于数据处理和分析
import numpy as np # 导入numpy库,用于数值计算
from datetime import datetime, timedelta # 导入datetime和timedelta模块,用于处理日期和时间
import os # 导入os库,用于操作系统相关功能,如文件和目录操作
# 创建results目录(如果不存在)
os.makedirs('results', exist_ok=True)
加载数据
python
# 加载数据
customers = pd.read_csv('data/customers.csv') # 加载客户数据
products = pd.read_csv('data/products.csv') # 加载产品数据
transactions = pd.read_csv('data/transactions.csv') # 加载交易数据
# 将日期列转换为datetime类型
customers['membership_date'] = pd.to_datetime(customers['membership_date']) # 将会员日期列转换为datetime类型
transactions['transaction_date'] = pd.to_datetime(transactions['transaction_date']) # 将交易日期列转换为datetime类型
1. 计算每个客户的总消费金额
python
# 1. 计算每个客户的总消费金额
customer_spending = transactions.groupby('customer_id')['amount'].sum().reset_index() # 按客户ID分组,计算每个客户的总消费金额
customer_spending.columns = ['customer_id', 'total_spending'] # 重命名列名
customer_spending.to_csv('results/customer_spending.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(customer_spending)
customer_id total_spending
0 1 1124.02
1 2 1710.33
2 3 1595.10
3 4 3881.27
4 5 3871.16
.. ... ...
995 996 1880.70
996 997 2683.27
997 998 1631.01
998 999 2473.05
999 1000 2608.60
[1000 rows x 2 columns]
2. 计算每个客户的平均订单金额
python
# 2. 计算每个客户的平均订单金额
avg_order_amount = transactions.groupby('customer_id').agg(
total_spending=('amount', 'sum'), # 计算每个客户的总消费金额
num_transactions=('transaction_id', 'count') # 计算每个客户的交易次数
).reset_index()
avg_order_amount['avg_order_amount'] = avg_order_amount['total_spending'] / avg_order_amount[
'num_transactions'] # 计算平均订单金额
avg_order_amount.to_csv('results/avg_order_amount.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(avg_order_amount)
customer_id total_spending num_transactions avg_order_amount
0 1 1124.02 10 112.402000
1 2 1710.33 9 190.036667
2 3 1595.10 11 145.009091
3 4 3881.27 14 277.233571
4 5 3871.16 13 297.781538
.. ... ... ... ...
995 996 1880.70 10 188.070000
996 997 2683.27 11 243.933636
997 998 1631.01 9 181.223333
998 999 2473.05 10 247.305000
999 1000 2608.60 8 326.075000
[1000 rows x 4 columns]
3. 按产品类别统计销售总额和销售量
python
# 3. 按产品类别统计销售总额和销售量
category_sales = pd.merge(transactions, products, on='product_id', how='left') # 将交易数据和产品数据按产品ID合并
category_sales = category_sales.groupby('category').agg(
total_sales=('amount', 'sum'), # 按产品类别分组,计算销售总额
total_quantity=('quantity', 'sum') # 按产品类别分组,计算销售量
).reset_index()
category_sales.to_csv('results/category_sales.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(category_sales)
category total_sales total_quantity
0 Books 279650.98 5903
1 Clothing 453744.32 4887
2 Electronics 552730.76 7566
3 Food 306883.36 5883
4 Home 382538.62 5863
4. 按性别统计客户数量
python
# 4. 按性别统计客户数量
gender_distribution = customers['gender'].value_counts().reset_index() # 统计不同性别的客户数量
gender_distribution.columns = ['gender', 'count'] # 重命名列名
gender_distribution.to_csv('results/gender_distribution.csv', index=False) # 将结果保存为CSV文件,不保存索引
5. 创建年龄分布直方图数据
python
# 5. 创建年龄分布直方图数据
age_bins = [18, 25, 35, 45, 55, 65, 80] # 定义年龄分组区间
age_labels = ['18-25', '26-35', '36-45', '46-55', '56-65', '66+'] # 定义年龄分组标签
customers['age_group'] = pd.cut(customers['age'], bins=age_bins, labels=age_labels) # 将客户年龄分组
age_distribution = customers['age_group'].value_counts().sort_index().reset_index() # 统计每个年龄组的客户数量
age_distribution.columns = ['age_group', 'count'] # 重命名列名
age_distribution.to_csv('results/age_distribution.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(age_distribution)
age_group count
0 18-25 132
1 26-35 349
2 36-45 325
3 46-55 126
4 56-65 21
5 66+ 0
6. 计算每个月的销售总额(时间序列分析)
python
# 6. 计算每个月的销售总额(时间序列分析)
transactions['month'] = transactions['transaction_date'].dt.to_period('M') # 提取交易日期的月份
monthly_sales = transactions.groupby('month')['amount'].sum().reset_index() # 按月份分组,计算每个月的销售总额
monthly_sales['month'] = monthly_sales['month'].astype(str) # 将月份转换为字符串类型
monthly_sales.to_csv('results/monthly_sales.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(monthly_sales)
month amount
0 2024-07 120410.93
1 2024-08 163879.27
2 2024-09 147267.93
3 2024-10 168343.63
4 2024-11 159979.17
5 2024-12 180396.79
6 2025-01 161902.65
7 2025-02 147212.28
8 2025-03 170734.04
9 2025-04 168111.73
10 2025-05 177653.05
11 2025-06 163866.35
12 2025-07 45790.22
7. 找出最畅销的10种产品
python
# 7. 找出最畅销的10种产品
top_products = transactions.groupby('product_id').agg(
total_quantity=('quantity', 'sum'), # 按产品ID分组,计算每种产品的销售总量
total_sales=('amount', 'sum') # 按产品ID分组,计算每种产品的销售总额
).reset_index()
top_products = pd.merge(top_products, products[['product_id', 'category', 'brand']], on='product_id', how='left') # 将产品信息合并到统计结果中
top_products = top_products.sort_values('total_sales', ascending=False).head(10) # 按销售总额降序排序,取前10种产品
top_products.to_csv('results/top_products.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(top_products)
product_id total_quantity total_sales category brand
38 39 627 132033.66 Clothing BrandB
36 37 623 103498.99 Electronics BrandE
15 16 605 92038.65 Clothing BrandE
26 27 588 87329.76 Home BrandE
40 41 645 82269.75 Electronics BrandA
48 49 518 69924.82 Food BrandD
23 24 659 69419.06 Electronics BrandA
4 5 684 54473.76 Electronics BrandD
25 26 606 54412.74 Electronics BrandE
10 11 654 53850.36 Electronics BrandD
8. 找出消费最高的10个客户
python
# 8. 找出消费最高的10个客户
top_customers = customer_spending.sort_values('total_spending', ascending=False).head(10) # 按总消费金额降序排序,取前10个客户
top_customers = pd.merge(top_customers, customers[['customer_id', 'gender', 'age', 'region']], on='customer_id', how='left') # 将客户信息合并到统计结果中
top_customers.to_csv('results/top_customers.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(top_customers)
customer_id total_spending gender age region
0 903 4568.08 Female 23 East
1 763 4421.28 Male 50 West
2 708 4409.81 Female 35 North
3 18 4405.35 Female 35 West
4 841 4353.57 Male 35 East
5 421 4266.48 Female 37 West
6 694 4037.89 Female 43 East
7 870 3987.35 Female 22 North
8 791 3925.87 Female 18 East
9 741 3888.52 Male 38 West
9. 计算不同品牌产品的平均价格
python
# 9. 计算不同品牌产品的平均价格
brand_prices = products.groupby('brand')['price'].agg(['mean', 'min', 'max', 'std']).reset_index() # 按品牌分组,计算每种品牌产品的平均价格、最低价格、最高价格和标准差
brand_prices.columns = ['brand', 'avg_price', 'min_price', 'max_price', 'price_std'] # 重命名列名
brand_prices.to_csv('results/brand_prices.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(brand_prices)
brand avg_price min_price max_price price_std
0 BrandA 62.249000 24.06 127.55 35.000554
1 BrandB 75.945000 17.22 210.58 70.256401
2 BrandC 48.707143 24.97 71.86 17.941207
3 BrandD 59.621875 14.78 134.99 28.791657
4 BrandE 80.880909 22.71 166.13 51.561905
10. 创建产品类别和性别之间的交叉表
python
# 10. 创建产品类别和性别之间的交叉表
merged_data = pd.merge(transactions, products, on='product_id', how='left') # 将交易数据和产品数据按产品ID合并
merged_data = pd.merge(merged_data, customers, on='customer_id', how='left') # 将合并后的数据和客户数据按客户ID合并
category_gender_crosstab = pd.crosstab(merged_data['category'], merged_data['gender']) # 创建产品类别和性别之间的交叉表
category_gender_crosstab.to_csv('results/category_gender_crosstab.csv') # 将结果保存为CSV文件
print(category_gender_crosstab)
gender Female Male
category
Books 998 964
Clothing 835 803
Electronics 1273 1207
Food 1050 918
Home 976 976
11. 创建产品类别和年龄组之间的交叉表
python
category_age_crosstab = pd.crosstab(merged_data['category'], merged_data['age_group']) # 创建产品类别和年龄组之间的交叉表
category_age_crosstab.to_csv('results/category_age_crosstab.csv') # 将结果保存为CSV文件
print(category_age_crosstab)
age_group 18-25 26-35 36-45 46-55 56-65
category
Books 267 670 623 273 44
Clothing 231 581 499 213 38
Electronics 337 884 796 308 50
Food 269 662 664 254 32
Home 251 709 616 232 39
12. 创建区域和产品类别之间的交叉表
python
# 12. 创建区域和产品类别之间的交叉表
region_category_crosstab = pd.crosstab(merged_data['region'], merged_data['category']) # 创建区域和产品类别之间的交叉表
region_category_crosstab.to_csv('results/region_category_crosstab.csv') # 将结果保存为CSV文件
print(region_category_crosstab)
category Books Clothing Electronics Food Home
region
East 474 391 592 467 467
North 448 390 588 463 468
South 472 419 601 485 465
West 568 438 699 553 552
13. 创建性别和区域之间的交叉表
python
# 13. 创建性别和区域之间的交叉表
gender_region_crosstab = pd.crosstab(customers['gender'], customers['region']) # 创建性别和区域之间的交叉表
gender_region_crosstab.to_csv('results/gender_region_crosstab.csv') # 将结果保存为CSV文件
print(gender_region_crosstab)
region East North South West
gender
Female 117 119 133 141
Male 123 121 110 136
14. 计算每个客户的首次购买日期和最近购买日期
python
# 14. 计算每个客户的首次购买日期和最近购买日期
customer_dates = transactions.groupby('customer_id').agg(
first_purchase_date=('transaction_date', 'min'), # 按客户ID分组,计算每个客户的首次购买日期
last_purchase_date=('transaction_date', 'max') # 按客户ID分组,计算每个客户的最近购买日期
).reset_index()
customer_dates.to_csv('results/customer_dates.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(customer_dates)
customer_id first_purchase_date last_purchase_date
0 1 2024-08-06 16:32:05.579393 2025-03-07 16:32:05.579393
1 2 2024-11-02 16:32:05.579393 2025-07-04 16:32:05.579393
2 3 2024-07-11 16:32:05.579393 2025-05-11 16:32:05.579393
3 4 2024-07-15 16:32:05.579393 2025-07-03 16:32:05.579393
4 5 2024-08-16 16:32:05.579393 2025-07-08 16:32:05.579393
.. ... ... ...
995 996 2024-07-11 16:32:05.579393 2025-06-24 16:32:05.579393
996 997 2024-08-19 16:32:05.579393 2025-07-06 16:32:05.579393
997 998 2024-08-26 16:32:05.579393 2025-07-05 16:32:05.579393
998 999 2024-08-11 16:32:05.579393 2025-06-23 16:32:05.579393
999 1000 2024-08-01 16:32:05.579393 2025-04-09 16:32:05.579393
[1000 rows x 3 columns]
15. 计算客户生命周期价值(CLV)假设为一年
python
# 15. 计算客户生命周期价值(CLV)假设为一年
clv_data = pd.merge(customer_spending, customer_dates, on='customer_id', how='left') # 将客户消费数据和购买日期数据按客户ID合并
clv_data['customer_lifetime'] = (clv_data['last_purchase_date'] - clv_data[
'first_purchase_date']).dt.days / 365 # 计算客户生命周期(年)
clv_data['clv'] = clv_data['total_spending'] / (clv_data['customer_lifetime'] + 0.001) # 计算客户生命周期价值,避免除零错误
clv_data.to_csv('results/customer_clv.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(clv_data)
customer_id total_spending first_purchase_date \
0 1 1124.02 2024-08-06 16:32:05.579393
1 2 1710.33 2024-11-02 16:32:05.579393
2 3 1595.10 2024-07-11 16:32:05.579393
3 4 3881.27 2024-07-15 16:32:05.579393
4 5 3871.16 2024-08-16 16:32:05.579393
.. ... ... ...
995 996 1880.70 2024-07-11 16:32:05.579393
996 997 2683.27 2024-08-19 16:32:05.579393
997 998 1631.01 2024-08-26 16:32:05.579393
998 999 2473.05 2024-08-11 16:32:05.579393
999 1000 2608.60 2024-08-01 16:32:05.579393
last_purchase_date customer_lifetime clv
0 2025-03-07 16:32:05.579393 0.583562 1922.842547
1 2025-07-04 16:32:05.579393 0.668493 2554.663925
2 2025-05-11 16:32:05.579393 0.832877 1912.872702
3 2025-07-03 16:32:05.579393 0.967123 4009.065838
4 2025-07-08 16:32:05.579393 0.893151 4329.426869
.. ... ... ...
995 2025-06-24 16:32:05.579393 0.953425 1970.506509
996 2025-07-06 16:32:05.579393 0.879452 3047.604904
997 2025-07-05 16:32:05.579393 0.857534 1899.761141
998 2025-06-23 16:32:05.579393 0.865753 2853.233607
999 2025-04-09 16:32:05.579393 0.687671 3787.874207
[1000 rows x 6 columns]
16. 创建一个透视表,显示每个区域、每个类别的销售总额
python
# 16. 创建一个透视表,显示每个区域、每个类别的销售总额
region_category_pivot = pd.pivot_table(
merged_data,
values='amount', # 透视表的值为交易金额
index='region', # 透视表的行索引为区域
columns='category', # 透视表的列索引为产品类别
aggfunc='sum', # 聚合函数为求和
fill_value=0 # 缺失值填充为0
)
region_category_pivot.to_csv('results/region_category_pivot.csv') # 将结果保存为CSV文件
print(region_category_pivot)
category Books Clothing Electronics Food Home
region
East 69158.99 108802.65 128664.17 73067.98 88262.81
North 62821.45 108363.43 130980.59 74564.24 90982.75
South 68916.92 110948.88 136255.84 73532.48 93670.87
West 78753.62 125629.36 156830.16 85718.66 109622.19
17. 创建一个透视表,显示每个月、每个类别的销售总额
python
# 17. 创建一个透视表,显示每个月、每个类别的销售总额
month_category_pivot = pd.pivot_table(
merged_data,
values='amount', # 透视表的值为交易金额
index='month', # 透视表的行索引为月份
columns='category', # 透视表的列索引为产品类别
aggfunc='sum', # 聚合函数为求和
fill_value=0 # 缺失值填充为0
)
month_category_pivot.index = month_category_pivot.index.astype(str) # 将月份索引转换为字符串类型
month_category_pivot.to_csv('results/month_category_pivot.csv') # 将结果保存为CSV文件
print(month_category_pivot)
category Books Clothing Electronics Food Home
month
2024-07 19049.16 28345.99 32982.40 15546.22 24487.16
2024-08 22482.08 39866.22 44932.19 25444.10 31154.68
2024-09 23453.31 33326.84 42242.67 20403.07 27842.04
2024-10 24327.48 36415.63 44971.30 26200.94 36428.28
2024-11 20299.31 34942.33 46316.72 24822.69 33598.12
2024-12 24097.06 47046.58 50235.43 26123.97 32893.75
2025-01 26473.88 34748.91 42122.97 27490.10 31066.79
2025-02 20271.88 31493.03 40852.26 26059.20 28535.91
2025-03 20973.53 37191.20 48798.25 29035.07 34735.99
2025-04 22417.78 41892.14 45848.84 24966.29 32986.68
2025-05 24946.51 43113.17 58419.63 23434.67 27739.07
2025-06 25288.91 37229.08 41895.60 29361.03 30091.73
2025-07 5570.09 8133.20 13112.50 7996.01 10978.42
18. 创建一个透视表,显示每个区域、每个性别在各个类别上的平均消费
python
# 18. 创建一个透视表,显示每个区域、每个性别在各个类别上的平均消费
region_gender_category_pivot = pd.pivot_table(
merged_data,
values='amount', # 透视表的值为交易金额
index=['region', 'gender'], # 透视表的行索引为区域和性别
columns='category', # 透视表的列索引为产品类别
aggfunc='mean', # 聚合函数为求平均值
fill_value=0 # 缺失值填充为0
)
region_gender_category_pivot.to_csv('results/region_gender_category_pivot.csv') # 将结果保存为CSV文件
print(region_gender_category_pivot)
category Books Clothing Electronics Food Home
region gender
East Female 140.329458 270.118182 234.613919 159.747362 185.777671
Male 151.623590 286.628238 200.062331 152.545305 191.844758
North Female 139.379052 257.678474 229.761571 151.541577 200.558000
Male 140.980886 297.022600 214.836884 171.363604 187.933465
South Female 143.690119 257.582944 232.432237 152.286984 199.536932
Male 148.668227 273.655426 220.863434 150.884807 203.678037
West Female 132.917729 272.563935 229.317895 149.335281 191.643158
Male 144.845751 300.700676 219.072189 161.880280 205.052832
19. 计算每个客户的购买频率(每年购买次数)
python
# 19. 计算每个客户的购买频率(每年购买次数)
purchase_frequency = transactions.groupby('customer_id').agg(
num_transactions=('transaction_id', 'count'), # 按客户ID分组,计算每个客户的交易次数
first_purchase_date=('transaction_date', 'min'), # 按客户ID分组,计算每个客户的首次购买日期
last_purchase_date=('transaction_date', 'max') # 按客户ID分组,计算每个客户的最近购买日期
).reset_index()
purchase_frequency['customer_lifetime'] = (purchase_frequency['last_purchase_date'] - purchase_frequency[
'first_purchase_date']).dt.days / 365 # 计算客户生命周期(年)
purchase_frequency['frequency_per_year'] = purchase_frequency['num_transactions'] / (
purchase_frequency['customer_lifetime'] + 0.001) # 计算购买频率(每年购买次数),避免除零错误
purchase_frequency.to_csv('results/purchase_frequency.csv', index=False) # 将结果保存为CSV文件,不保存索引
print(purchase_frequency)
customer_id num_transactions first_purchase_date \
0 1 10 2024-08-06 16:32:05.579393
1 2 9 2024-11-02 16:32:05.579393
2 3 11 2024-07-11 16:32:05.579393
3 4 14 2024-07-15 16:32:05.579393
4 5 13 2024-08-16 16:32:05.579393
.. ... ... ...
995 996 10 2024-07-11 16:32:05.579393
996 997 11 2024-08-19 16:32:05.579393
997 998 9 2024-08-26 16:32:05.579393
998 999 10 2024-08-11 16:32:05.579393
999 1000 8 2024-08-01 16:32:05.579393
last_purchase_date customer_lifetime frequency_per_year
0 2025-03-07 16:32:05.579393 0.583562 17.106836
1 2025-07-04 16:32:05.579393 0.668493 13.443005
2 2025-05-11 16:32:05.579393 0.832877 13.191398
3 2025-07-03 16:32:05.579393 0.967123 14.460968
4 2025-07-08 16:32:05.579393 0.893151 14.538936
.. ... ... ...
995 2025-06-24 16:32:05.579393 0.953425 10.477516
996 2025-07-06 16:32:05.579393 0.879452 12.493582
997 2025-07-05 16:32:05.579393 0.857534 10.482983
998 2025-06-23 16:32:05.579393 0.865753 11.537307
999 2025-04-09 16:32:05.579393 0.687671 11.616574
[1000 rows x 6 columns]
20. 分析会员时长与消费金额之间的关系
python
# 20. 分析会员时长与消费金额之间的关系
customers['membership_days'] = (datetime.now() - customers['membership_date']).dt.days # 计算会员时长(天)
customer_membership = pd.merge(customers, customer_spending, on='customer_id', how='left') # 将客户数据和消费数据按客户ID合并
membership_spending_correlation = customer_membership[['membership_days', 'total_spending']].corr() # 计算会员时长和消费金额之间的相关性
customer_membership.to_csv('results/customer_membership.csv', index=False) # 将合并后的数据保存为CSV文件,不保存索引
membership_spending_correlation.to_csv('results/membership_spending_correlation.csv') # 将相关性结果保存为CSV文件
print(membership_spending_correlation)
membership_days total_spending
membership_days 1.000000 -0.008689
total_spending -0.008689 1.000000
python
print("数据分析完成,结果已保存到results目录下的CSV文件中。")
数据分析完成,结果已保存到results目录下的CSV文件中。