pandas销售数据分析

数据保存在data目录

消费者数据：customers.csv
商品数据：products.csv
交易数据：transactions.csv

customers.csv数据结构：

字段	描述
customer_id	客户ID
gender	性别
age	年龄
region	地区
membership_date	会员日期

products.csv数据结构：

字段	描述
product_id	产品ID
category	产品类别
brand	品牌
price	价格

transactions.csv数据结构：

字段	描述
transaction_id	交易ID
customer_id	客户ID
product_id	产品ID
quantity	购买数量
transaction_date	交易日期
price	交易价格
amount	交易金额

加载CSV数据，编写代码完成以下需求：

计算每个客户的总消费金额
计算每个客户的平均订单金额
按产品类别统计销售总额和销售量
按性别统计客户数量
创建年龄分布直方图数据
计算每个月的销售总额（时间序列分析）
找出最畅销的10种产品
找出消费最高的10个客户
计算不同品牌产品的平均价格
创建产品类别和性别之间的交叉表
创建产品类别和年龄组之间的交叉表
创建区域和产品类别之间的交叉表
创建性别和区域之间的交叉表
计算每个客户的首次购买日期和最近购买日期
计算客户生命周期价值（CLV）假设为一年
创建一个透视表，显示每个区域、每个类别的销售总额
创建一个透视表，显示每个月、每个类别的销售总额
创建一个透视表，显示每个区域、每个性别在各个类别上的平均消费
计算每个客户的购买频率（每年购买次数）
分析会员时长与消费金额之间的关系

导包

python 复制代码

import pandas as pd  # 导入pandas库，用于数据处理和分析
import numpy as np  # 导入numpy库，用于数值计算
from datetime import datetime, timedelta  # 导入datetime和timedelta模块，用于处理日期和时间
import os  # 导入os库，用于操作系统相关功能，如文件和目录操作

# 创建results目录（如果不存在）
os.makedirs('results', exist_ok=True)

加载数据

python 复制代码

# 加载数据
customers = pd.read_csv('data/customers.csv')  # 加载客户数据
products = pd.read_csv('data/products.csv')  # 加载产品数据
transactions = pd.read_csv('data/transactions.csv')  # 加载交易数据

# 将日期列转换为datetime类型
customers['membership_date'] = pd.to_datetime(customers['membership_date'])  # 将会员日期列转换为datetime类型
transactions['transaction_date'] = pd.to_datetime(transactions['transaction_date'])  # 将交易日期列转换为datetime类型

1. 计算每个客户的总消费金额

python 复制代码

# 1. 计算每个客户的总消费金额
customer_spending = transactions.groupby('customer_id')['amount'].sum().reset_index()  # 按客户ID分组，计算每个客户的总消费金额
customer_spending.columns = ['customer_id', 'total_spending']  # 重命名列名
customer_spending.to_csv('results/customer_spending.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(customer_spending)

复制代码

     customer_id  total_spending
0              1         1124.02
1              2         1710.33
2              3         1595.10
3              4         3881.27
4              5         3871.16
..           ...             ...
995          996         1880.70
996          997         2683.27
997          998         1631.01
998          999         2473.05
999         1000         2608.60

[1000 rows x 2 columns]

2. 计算每个客户的平均订单金额

python 复制代码

# 2. 计算每个客户的平均订单金额
avg_order_amount = transactions.groupby('customer_id').agg(
  total_spending=('amount', 'sum'),  # 计算每个客户的总消费金额
  num_transactions=('transaction_id', 'count')  # 计算每个客户的交易次数
).reset_index()
avg_order_amount['avg_order_amount'] = avg_order_amount['total_spending'] / avg_order_amount[
  'num_transactions']  # 计算平均订单金额
avg_order_amount.to_csv('results/avg_order_amount.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(avg_order_amount)

复制代码

     customer_id  total_spending  num_transactions  avg_order_amount
0              1         1124.02                10        112.402000
1              2         1710.33                 9        190.036667
2              3         1595.10                11        145.009091
3              4         3881.27                14        277.233571
4              5         3871.16                13        297.781538
..           ...             ...               ...               ...
995          996         1880.70                10        188.070000
996          997         2683.27                11        243.933636
997          998         1631.01                 9        181.223333
998          999         2473.05                10        247.305000
999         1000         2608.60                 8        326.075000

[1000 rows x 4 columns]

3. 按产品类别统计销售总额和销售量

python 复制代码

# 3. 按产品类别统计销售总额和销售量
category_sales = pd.merge(transactions, products, on='product_id', how='left')  # 将交易数据和产品数据按产品ID合并
category_sales = category_sales.groupby('category').agg(
  total_sales=('amount', 'sum'),  # 按产品类别分组，计算销售总额
  total_quantity=('quantity', 'sum')  # 按产品类别分组，计算销售量
).reset_index()
category_sales.to_csv('results/category_sales.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(category_sales)

复制代码

      category  total_sales  total_quantity
0        Books    279650.98            5903
1     Clothing    453744.32            4887
2  Electronics    552730.76            7566
3         Food    306883.36            5883
4         Home    382538.62            5863

4. 按性别统计客户数量

python 复制代码

# 4. 按性别统计客户数量
gender_distribution = customers['gender'].value_counts().reset_index()  # 统计不同性别的客户数量
gender_distribution.columns = ['gender', 'count']  # 重命名列名
gender_distribution.to_csv('results/gender_distribution.csv', index=False)  # 将结果保存为CSV文件，不保存索引

5. 创建年龄分布直方图数据

python 复制代码

# 5. 创建年龄分布直方图数据
age_bins = [18, 25, 35, 45, 55, 65, 80]  # 定义年龄分组区间
age_labels = ['18-25', '26-35', '36-45', '46-55', '56-65', '66+']  # 定义年龄分组标签
customers['age_group'] = pd.cut(customers['age'], bins=age_bins, labels=age_labels)  # 将客户年龄分组
age_distribution = customers['age_group'].value_counts().sort_index().reset_index()  # 统计每个年龄组的客户数量
age_distribution.columns = ['age_group', 'count']  # 重命名列名
age_distribution.to_csv('results/age_distribution.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(age_distribution)

复制代码

  age_group  count
0     18-25    132
1     26-35    349
2     36-45    325
3     46-55    126
4     56-65     21
5       66+      0

6. 计算每个月的销售总额（时间序列分析）

python 复制代码

# 6. 计算每个月的销售总额（时间序列分析）
transactions['month'] = transactions['transaction_date'].dt.to_period('M')  # 提取交易日期的月份
monthly_sales = transactions.groupby('month')['amount'].sum().reset_index()  # 按月份分组，计算每个月的销售总额
monthly_sales['month'] = monthly_sales['month'].astype(str)  # 将月份转换为字符串类型
monthly_sales.to_csv('results/monthly_sales.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(monthly_sales)

复制代码

      month     amount
0   2024-07  120410.93
1   2024-08  163879.27
2   2024-09  147267.93
3   2024-10  168343.63
4   2024-11  159979.17
5   2024-12  180396.79
6   2025-01  161902.65
7   2025-02  147212.28
8   2025-03  170734.04
9   2025-04  168111.73
10  2025-05  177653.05
11  2025-06  163866.35
12  2025-07   45790.22

7. 找出最畅销的10种产品

python 复制代码

# 7. 找出最畅销的10种产品
top_products = transactions.groupby('product_id').agg(
  total_quantity=('quantity', 'sum'),  # 按产品ID分组，计算每种产品的销售总量
  total_sales=('amount', 'sum')  # 按产品ID分组，计算每种产品的销售总额
).reset_index()
top_products = pd.merge(top_products, products[['product_id', 'category', 'brand']], on='product_id', how='left')  # 将产品信息合并到统计结果中
top_products = top_products.sort_values('total_sales', ascending=False).head(10)  # 按销售总额降序排序，取前10种产品
top_products.to_csv('results/top_products.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(top_products)

复制代码

    product_id  total_quantity  total_sales     category   brand
38          39             627    132033.66     Clothing  BrandB
36          37             623    103498.99  Electronics  BrandE
15          16             605     92038.65     Clothing  BrandE
26          27             588     87329.76         Home  BrandE
40          41             645     82269.75  Electronics  BrandA
48          49             518     69924.82         Food  BrandD
23          24             659     69419.06  Electronics  BrandA
4            5             684     54473.76  Electronics  BrandD
25          26             606     54412.74  Electronics  BrandE
10          11             654     53850.36  Electronics  BrandD

8. 找出消费最高的10个客户

python 复制代码

# 8. 找出消费最高的10个客户
top_customers = customer_spending.sort_values('total_spending', ascending=False).head(10)  # 按总消费金额降序排序，取前10个客户
top_customers = pd.merge(top_customers, customers[['customer_id', 'gender', 'age', 'region']], on='customer_id', how='left')  # 将客户信息合并到统计结果中
top_customers.to_csv('results/top_customers.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(top_customers)

复制代码

   customer_id  total_spending  gender  age region
0          903         4568.08  Female   23   East
1          763         4421.28    Male   50   West
2          708         4409.81  Female   35  North
3           18         4405.35  Female   35   West
4          841         4353.57    Male   35   East
5          421         4266.48  Female   37   West
6          694         4037.89  Female   43   East
7          870         3987.35  Female   22  North
8          791         3925.87  Female   18   East
9          741         3888.52    Male   38   West

9. 计算不同品牌产品的平均价格

python 复制代码

# 9. 计算不同品牌产品的平均价格
brand_prices = products.groupby('brand')['price'].agg(['mean', 'min', 'max', 'std']).reset_index()  # 按品牌分组，计算每种品牌产品的平均价格、最低价格、最高价格和标准差
brand_prices.columns = ['brand', 'avg_price', 'min_price', 'max_price', 'price_std']  # 重命名列名
brand_prices.to_csv('results/brand_prices.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(brand_prices)

复制代码

    brand  avg_price  min_price  max_price  price_std
0  BrandA  62.249000      24.06     127.55  35.000554
1  BrandB  75.945000      17.22     210.58  70.256401
2  BrandC  48.707143      24.97      71.86  17.941207
3  BrandD  59.621875      14.78     134.99  28.791657
4  BrandE  80.880909      22.71     166.13  51.561905

10. 创建产品类别和性别之间的交叉表

python 复制代码

# 10. 创建产品类别和性别之间的交叉表
merged_data = pd.merge(transactions, products, on='product_id', how='left')  # 将交易数据和产品数据按产品ID合并
merged_data = pd.merge(merged_data, customers, on='customer_id', how='left')  # 将合并后的数据和客户数据按客户ID合并
category_gender_crosstab = pd.crosstab(merged_data['category'], merged_data['gender'])  # 创建产品类别和性别之间的交叉表
category_gender_crosstab.to_csv('results/category_gender_crosstab.csv')  # 将结果保存为CSV文件
print(category_gender_crosstab)

复制代码

gender       Female  Male
category                 
Books           998   964
Clothing        835   803
Electronics    1273  1207
Food           1050   918
Home            976   976

11. 创建产品类别和年龄组之间的交叉表

python 复制代码

category_age_crosstab = pd.crosstab(merged_data['category'], merged_data['age_group'])  # 创建产品类别和年龄组之间的交叉表
category_age_crosstab.to_csv('results/category_age_crosstab.csv')  # 将结果保存为CSV文件
print(category_age_crosstab)

复制代码

age_group    18-25  26-35  36-45  46-55  56-65
category                                      
Books          267    670    623    273     44
Clothing       231    581    499    213     38
Electronics    337    884    796    308     50
Food           269    662    664    254     32
Home           251    709    616    232     39

12. 创建区域和产品类别之间的交叉表

python 复制代码

# 12. 创建区域和产品类别之间的交叉表
region_category_crosstab = pd.crosstab(merged_data['region'], merged_data['category'])  # 创建区域和产品类别之间的交叉表
region_category_crosstab.to_csv('results/region_category_crosstab.csv')  # 将结果保存为CSV文件
print(region_category_crosstab)

复制代码

category  Books  Clothing  Electronics  Food  Home
region                                            
East        474       391          592   467   467
North       448       390          588   463   468
South       472       419          601   485   465
West        568       438          699   553   552

13. 创建性别和区域之间的交叉表

python 复制代码

# 13. 创建性别和区域之间的交叉表
gender_region_crosstab = pd.crosstab(customers['gender'], customers['region'])  # 创建性别和区域之间的交叉表
gender_region_crosstab.to_csv('results/gender_region_crosstab.csv')  # 将结果保存为CSV文件
print(gender_region_crosstab)

复制代码

region  East  North  South  West
gender                          
Female   117    119    133   141
Male     123    121    110   136

14. 计算每个客户的首次购买日期和最近购买日期

python 复制代码

# 14. 计算每个客户的首次购买日期和最近购买日期
customer_dates = transactions.groupby('customer_id').agg(
  first_purchase_date=('transaction_date', 'min'),  # 按客户ID分组，计算每个客户的首次购买日期
  last_purchase_date=('transaction_date', 'max')  # 按客户ID分组，计算每个客户的最近购买日期
).reset_index()
customer_dates.to_csv('results/customer_dates.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(customer_dates)

复制代码

     customer_id        first_purchase_date         last_purchase_date
0              1 2024-08-06 16:32:05.579393 2025-03-07 16:32:05.579393
1              2 2024-11-02 16:32:05.579393 2025-07-04 16:32:05.579393
2              3 2024-07-11 16:32:05.579393 2025-05-11 16:32:05.579393
3              4 2024-07-15 16:32:05.579393 2025-07-03 16:32:05.579393
4              5 2024-08-16 16:32:05.579393 2025-07-08 16:32:05.579393
..           ...                        ...                        ...
995          996 2024-07-11 16:32:05.579393 2025-06-24 16:32:05.579393
996          997 2024-08-19 16:32:05.579393 2025-07-06 16:32:05.579393
997          998 2024-08-26 16:32:05.579393 2025-07-05 16:32:05.579393
998          999 2024-08-11 16:32:05.579393 2025-06-23 16:32:05.579393
999         1000 2024-08-01 16:32:05.579393 2025-04-09 16:32:05.579393

[1000 rows x 3 columns]

15. 计算客户生命周期价值（CLV）假设为一年

python 复制代码

# 15. 计算客户生命周期价值（CLV）假设为一年
clv_data = pd.merge(customer_spending, customer_dates, on='customer_id', how='left')  # 将客户消费数据和购买日期数据按客户ID合并
clv_data['customer_lifetime'] = (clv_data['last_purchase_date'] - clv_data[
  'first_purchase_date']).dt.days / 365  # 计算客户生命周期（年）
clv_data['clv'] = clv_data['total_spending'] / (clv_data['customer_lifetime'] + 0.001)  # 计算客户生命周期价值，避免除零错误
clv_data.to_csv('results/customer_clv.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(clv_data)

复制代码

     customer_id  total_spending        first_purchase_date  \
0              1         1124.02 2024-08-06 16:32:05.579393   
1              2         1710.33 2024-11-02 16:32:05.579393   
2              3         1595.10 2024-07-11 16:32:05.579393   
3              4         3881.27 2024-07-15 16:32:05.579393   
4              5         3871.16 2024-08-16 16:32:05.579393   
..           ...             ...                        ...   
995          996         1880.70 2024-07-11 16:32:05.579393   
996          997         2683.27 2024-08-19 16:32:05.579393   
997          998         1631.01 2024-08-26 16:32:05.579393   
998          999         2473.05 2024-08-11 16:32:05.579393   
999         1000         2608.60 2024-08-01 16:32:05.579393   

            last_purchase_date  customer_lifetime          clv  
0   2025-03-07 16:32:05.579393           0.583562  1922.842547  
1   2025-07-04 16:32:05.579393           0.668493  2554.663925  
2   2025-05-11 16:32:05.579393           0.832877  1912.872702  
3   2025-07-03 16:32:05.579393           0.967123  4009.065838  
4   2025-07-08 16:32:05.579393           0.893151  4329.426869  
..                         ...                ...          ...  
995 2025-06-24 16:32:05.579393           0.953425  1970.506509  
996 2025-07-06 16:32:05.579393           0.879452  3047.604904  
997 2025-07-05 16:32:05.579393           0.857534  1899.761141  
998 2025-06-23 16:32:05.579393           0.865753  2853.233607  
999 2025-04-09 16:32:05.579393           0.687671  3787.874207  

[1000 rows x 6 columns]

16. 创建一个透视表，显示每个区域、每个类别的销售总额

python 复制代码

# 16. 创建一个透视表，显示每个区域、每个类别的销售总额
region_category_pivot = pd.pivot_table(
  merged_data,
  values='amount',  # 透视表的值为交易金额
  index='region',  # 透视表的行索引为区域
  columns='category',  # 透视表的列索引为产品类别
  aggfunc='sum',  # 聚合函数为求和
  fill_value=0  # 缺失值填充为0
)
region_category_pivot.to_csv('results/region_category_pivot.csv')  # 将结果保存为CSV文件
print(region_category_pivot)

复制代码

category     Books   Clothing  Electronics      Food       Home
region                                                         
East      69158.99  108802.65    128664.17  73067.98   88262.81
North     62821.45  108363.43    130980.59  74564.24   90982.75
South     68916.92  110948.88    136255.84  73532.48   93670.87
West      78753.62  125629.36    156830.16  85718.66  109622.19

17. 创建一个透视表，显示每个月、每个类别的销售总额

python 复制代码

# 17. 创建一个透视表，显示每个月、每个类别的销售总额
month_category_pivot = pd.pivot_table(
  merged_data,
  values='amount',  # 透视表的值为交易金额
  index='month',  # 透视表的行索引为月份
  columns='category',  # 透视表的列索引为产品类别
  aggfunc='sum',  # 聚合函数为求和
  fill_value=0  # 缺失值填充为0
)
month_category_pivot.index = month_category_pivot.index.astype(str)  # 将月份索引转换为字符串类型
month_category_pivot.to_csv('results/month_category_pivot.csv')  # 将结果保存为CSV文件
print(month_category_pivot)

复制代码

category     Books  Clothing  Electronics      Food      Home
month                                                        
2024-07   19049.16  28345.99     32982.40  15546.22  24487.16
2024-08   22482.08  39866.22     44932.19  25444.10  31154.68
2024-09   23453.31  33326.84     42242.67  20403.07  27842.04
2024-10   24327.48  36415.63     44971.30  26200.94  36428.28
2024-11   20299.31  34942.33     46316.72  24822.69  33598.12
2024-12   24097.06  47046.58     50235.43  26123.97  32893.75
2025-01   26473.88  34748.91     42122.97  27490.10  31066.79
2025-02   20271.88  31493.03     40852.26  26059.20  28535.91
2025-03   20973.53  37191.20     48798.25  29035.07  34735.99
2025-04   22417.78  41892.14     45848.84  24966.29  32986.68
2025-05   24946.51  43113.17     58419.63  23434.67  27739.07
2025-06   25288.91  37229.08     41895.60  29361.03  30091.73
2025-07    5570.09   8133.20     13112.50   7996.01  10978.42

18. 创建一个透视表，显示每个区域、每个性别在各个类别上的平均消费

python 复制代码

# 18. 创建一个透视表，显示每个区域、每个性别在各个类别上的平均消费
region_gender_category_pivot = pd.pivot_table(
  merged_data,
  values='amount',  # 透视表的值为交易金额
  index=['region', 'gender'],  # 透视表的行索引为区域和性别
  columns='category',  # 透视表的列索引为产品类别
  aggfunc='mean',  # 聚合函数为求平均值
  fill_value=0  # 缺失值填充为0
)
region_gender_category_pivot.to_csv('results/region_gender_category_pivot.csv')  # 将结果保存为CSV文件
print(region_gender_category_pivot)

复制代码

category            Books    Clothing  Electronics        Food        Home
region gender                                                             
East   Female  140.329458  270.118182   234.613919  159.747362  185.777671
       Male    151.623590  286.628238   200.062331  152.545305  191.844758
North  Female  139.379052  257.678474   229.761571  151.541577  200.558000
       Male    140.980886  297.022600   214.836884  171.363604  187.933465
South  Female  143.690119  257.582944   232.432237  152.286984  199.536932
       Male    148.668227  273.655426   220.863434  150.884807  203.678037
West   Female  132.917729  272.563935   229.317895  149.335281  191.643158
       Male    144.845751  300.700676   219.072189  161.880280  205.052832

19. 计算每个客户的购买频率（每年购买次数）

python 复制代码

# 19. 计算每个客户的购买频率（每年购买次数）
purchase_frequency = transactions.groupby('customer_id').agg(
  num_transactions=('transaction_id', 'count'),  # 按客户ID分组，计算每个客户的交易次数
  first_purchase_date=('transaction_date', 'min'),  # 按客户ID分组，计算每个客户的首次购买日期
  last_purchase_date=('transaction_date', 'max')  # 按客户ID分组，计算每个客户的最近购买日期
).reset_index()
purchase_frequency['customer_lifetime'] = (purchase_frequency['last_purchase_date'] - purchase_frequency[
  'first_purchase_date']).dt.days / 365  # 计算客户生命周期（年）
purchase_frequency['frequency_per_year'] = purchase_frequency['num_transactions'] / (
    purchase_frequency['customer_lifetime'] + 0.001)  # 计算购买频率（每年购买次数），避免除零错误
purchase_frequency.to_csv('results/purchase_frequency.csv', index=False)  # 将结果保存为CSV文件，不保存索引
print(purchase_frequency)

复制代码

     customer_id  num_transactions        first_purchase_date  \
0              1                10 2024-08-06 16:32:05.579393   
1              2                 9 2024-11-02 16:32:05.579393   
2              3                11 2024-07-11 16:32:05.579393   
3              4                14 2024-07-15 16:32:05.579393   
4              5                13 2024-08-16 16:32:05.579393   
..           ...               ...                        ...   
995          996                10 2024-07-11 16:32:05.579393   
996          997                11 2024-08-19 16:32:05.579393   
997          998                 9 2024-08-26 16:32:05.579393   
998          999                10 2024-08-11 16:32:05.579393   
999         1000                 8 2024-08-01 16:32:05.579393   

            last_purchase_date  customer_lifetime  frequency_per_year  
0   2025-03-07 16:32:05.579393           0.583562           17.106836  
1   2025-07-04 16:32:05.579393           0.668493           13.443005  
2   2025-05-11 16:32:05.579393           0.832877           13.191398  
3   2025-07-03 16:32:05.579393           0.967123           14.460968  
4   2025-07-08 16:32:05.579393           0.893151           14.538936  
..                         ...                ...                 ...  
995 2025-06-24 16:32:05.579393           0.953425           10.477516  
996 2025-07-06 16:32:05.579393           0.879452           12.493582  
997 2025-07-05 16:32:05.579393           0.857534           10.482983  
998 2025-06-23 16:32:05.579393           0.865753           11.537307  
999 2025-04-09 16:32:05.579393           0.687671           11.616574  

[1000 rows x 6 columns]

20. 分析会员时长与消费金额之间的关系

python 复制代码

# 20. 分析会员时长与消费金额之间的关系
customers['membership_days'] = (datetime.now() - customers['membership_date']).dt.days  # 计算会员时长（天）
customer_membership = pd.merge(customers, customer_spending, on='customer_id', how='left')  # 将客户数据和消费数据按客户ID合并
membership_spending_correlation = customer_membership[['membership_days', 'total_spending']].corr()  # 计算会员时长和消费金额之间的相关性
customer_membership.to_csv('results/customer_membership.csv', index=False)  # 将合并后的数据保存为CSV文件，不保存索引
membership_spending_correlation.to_csv('results/membership_spending_correlation.csv')  # 将相关性结果保存为CSV文件
print(membership_spending_correlation)

复制代码

                 membership_days  total_spending
membership_days         1.000000       -0.008689
total_spending         -0.008689        1.000000

python 复制代码

print("数据分析完成，结果已保存到results目录下的CSV文件中。")

复制代码

数据分析完成，结果已保存到results目录下的CSV文件中。