基于Python实现亚马逊销售数据分析与预测

介绍

亚马逊销售数据集生动地展现了现代商业的一个横截面，揭示了消费者行为和订单履行的趋势。在本笔记本中，我们将探索数据的各个方面------从清理和可视化到开发一个预测交易总额的模型。

python 复制代码

# Import necessary libraries and set configurations
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import matplotlib
matplotlib.use('Agg')  # Ensure we use the Agg backend
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style="whitegrid")

# Print a quick version check (uncomment if needed)
# print('Pandas version:', pd.__version__)
# print('NumPy version:', np.__version__)

python 复制代码

# Load the Amazon sales dataset
df = pd.read_csv('/Amazon.csv', encoding='ascii', delimiter=',')

# Display the first few rows to verify the data loaded correctly
print('Dataset Shape:', df.shape)
df.head()

复制代码

Dataset Shape: (100000, 20)

|---|------------|------------|------------|---------------|-----------|---------------------|-----------------|------------|----------|-----------|----------|-------|--------------|-------------|------------------|-------------|-------------|-------|---------------|-----------|
| | OrderID | OrderDate | CustomerID | CustomerName | ProductID | ProductName | Category | Brand | Quantity | UnitPrice | Discount | Tax | ShippingCost | TotalAmount | PaymentMethod | OrderStatus | City | State | Country | SellerID |
| 0 | ORD0000001 | 2023-01-31 | CUST001504 | Vihaan Sharma | P00014 | Drone Mini | Books | BrightLux | 3 | 106.59 | 0.00 | 0.00 | 0.09 | 319.86 | Debit Card | Delivered | Washington | DC | India | SELL01967 |
| 1 | ORD0000002 | 2023-12-30 | CUST000178 | Pooja Kumar | P00040 | Microphone | Home & Kitchen | UrbanStyle | 1 | 251.37 | 0.05 | 19.10 | 1.74 | 259.64 | Amazon Pay | Delivered | Fort Worth | TX | United States | SELL01298 |
| 2 | ORD0000003 | 2022-05-10 | CUST047516 | Sneha Singh | P00044 | Power Bank 20000mAh | Clothing | UrbanStyle | 3 | 35.03 | 0.10 | 7.57 | 5.91 | 108.06 | Debit Card | Delivered | Austin | TX | United States | SELL00908 |
| 3 | ORD0000004 | 2023-07-18 | CUST030059 | Vihaan Reddy | P00041 | Webcam Full HD | Home & Kitchen | Zenith | 5 | 33.58 | 0.15 | 11.42 | 5.53 | 159.66 | Cash on Delivery | Delivered | Charlotte | NC | India | SELL01164 |
| 4 | ORD0000005 | 2023-02-04 | CUST048677 | Aditya Kapoor | P00029 | T-Shirt | Clothing | KiddoFun | 2 | 515.64 | 0.25 | 38.67 | 9.23 | 821.36 | Credit Card | Cancelled | San Antonio | TX | Canada | SELL01411 |

数据加载与预处理

在本节中，我们将对数据进行清理，并执行预处理步骤。请注意，"OrderDate"列原本是字符串类型，将被转换为日期时间对象，但在格式不符合预期的情况下，这可能会成为产生错误的原因。如果您遇到日期转换问题，请确保在 pd.to_datetime 函数中的格式参数被正确指定。

python 复制代码

# Convert OrderDate to datetime format
df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')

# Check for missing values in the dataset
missing_values = df.isnull().sum()
print('Missing values in each column:')
print(missing_values)

# (Optional) Drop rows or fill missing values if any - here we simply show the counts
# df.dropna(inplace=True)  # Uncomment if you wish to drop missing values entirely

# Display info to verify data types
df.info()

复制代码

Missing values in each column:
OrderID          0
OrderDate        0
CustomerID       0
CustomerName     0
ProductID        0
ProductName      0
Category         0
Brand            0
Quantity         0
UnitPrice        0
Discount         0
Tax              0
ShippingCost     0
TotalAmount      0
PaymentMethod    0
OrderStatus      0
City             0
State            0
Country          0
SellerID         0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 20 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   OrderID        100000 non-null  object        
 1   OrderDate      100000 non-null  datetime64[ns]
 2   CustomerID     100000 non-null  object        
 3   CustomerName   100000 non-null  object        
 4   ProductID      100000 non-null  object        
 5   ProductName    100000 non-null  object        
 6   Category       100000 non-null  object        
 7   Brand          100000 non-null  object        
 8   Quantity       100000 non-null  int64         
 9   UnitPrice      100000 non-null  float64       
 10  Discount       100000 non-null  float64       
 11  Tax            100000 non-null  float64       
 12  ShippingCost   100000 non-null  float64       
 13  TotalAmount    100000 non-null  float64       
 14  PaymentMethod  100000 non-null  object        
 15  OrderStatus    100000 non-null  object        
 16  City           100000 non-null  object        
 17  State          100000 non-null  object        
 18  Country        100000 non-null  object        
 19  SellerID       100000 non-null  object        
dtypes: datetime64[ns](1), float64(5), int64(1), object(13)
memory usage: 15.3+ MB

探索性数据分析

在这里，我们对数据集进行了多种可视化处理，包括相关性热图、配对图以及分布图表等。这有助于在开发任何预测模型之前，增强我们对数据的直观理解。

我们采用了多种可视化方法，包括热图、直方图、关联图、箱形图等等，以全面呈现情况。

python 复制代码

# Select numeric columns for correlation analysis
numeric_df = df.select_dtypes(include=[np.number])

# Only plot if we have at least four numeric columns
if numeric_df.shape[1] >= 4:
    plt.figure(figsize=(10, 8))
    corr_matrix = numeric_df.corr()
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap of Numeric Features')
    plt.show()
else:
    print('Not enough numeric columns for a correlation heatmap.')

python 复制代码

# Create a pair plot for selected numerical features
# Using a subset of features that are believed to impact the TotalAmount
pairplot_cols = ['Quantity', 'UnitPrice', 'Discount', 'Tax', 'ShippingCost', 'TotalAmount']
if set(pairplot_cols).issubset(df.columns):
    sns.pairplot(df[pairplot_cols].dropna())
    plt.suptitle('Pair Plot for Selected Numerical Features', y=1.02)
    plt.show()
else:
    print('Some of the required columns for pair plot are missing.')

python 复制代码

# Plot a histogram for the TotalAmount to get an idea of its distribution
plt.figure(figsize=(8, 5))
sns.histplot(df['TotalAmount'].dropna(), bins=30, kde=True, color='skyblue')
plt.title('Distribution of TotalAmount')
plt.xlabel('TotalAmount')
plt.ylabel('Frequency')
plt.show()

python 复制代码

# Plot a count plot for PaymentMethod to see the frequency of each method
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='PaymentMethod', palette='viridis')
plt.title('Count Plot of Payment Methods')
plt.xlabel('Payment Method')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

python 复制代码

# Box plot to understand the distribution of UnitPrice
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, y='UnitPrice', color='lightgreen')
plt.title('Box Plot of UnitPrice')
plt.ylabel('UnitPrice')
plt.show()

python 复制代码

# Boxen plot to visualize the Discount across different Categories
plt.figure(figsize=(10, 6))
sns.boxenplot(data=df, x='Category', y='Discount', palette='Pastel1')
plt.title('Discount Distribution by Category')
plt.xlabel('Category')
plt.ylabel('Discount')
plt.xticks(rotation=45)
plt.show()

python 复制代码

# Violin plot to compare ShippingCost distribution among Payment Methods
plt.figure(figsize=(10, 6))
sns.violinplot(data=df, x='PaymentMethod', y='ShippingCost', palette='muted')
plt.title('ShippingCost Distribution by PaymentMethod')
plt.xlabel('PaymentMethod')
plt.ylabel('ShippingCost')
plt.xticks(rotation=45)
plt.show()

python 复制代码

# Strip plot to show TotalAmount variation across different OrderStatus values
plt.figure(figsize=(10, 6))
sns.stripplot(data=df, x='OrderStatus', y='TotalAmount', jitter=True, palette='Spectral')
plt.title('TotalAmount by OrderStatus')
plt.xlabel('OrderStatus')
plt.ylabel('TotalAmount')
plt.xticks(rotation=45)
plt.show()

预测建模

在本节中，我们试图预测订单的总金额。我们选择了那些为数值型且可能对总销售额产生影响的特征：数量、单价、折扣、税费以及运费。

由于"总金额"是一个连续变量，所以我们采用回归模型。在回归分析中，准确性最好通过 R² 分数来衡量。如果遇到与特征编码或缺失数据相关的错误，请务必进行适当的预处理。

python 复制代码

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Define features and target
feature_cols = ['Quantity', 'UnitPrice', 'Discount', 'Tax', 'ShippingCost']
target_col = 'TotalAmount'

# Drop rows with missing values in the features/target columns
model_df = df[feature_cols + [target_col]].dropna()

# Split the data into training and testing sets
X = model_df[feature_cols]
y = model_df[target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict on the test set
y_pred = lr.predict(X_test)

# Calculate R^2 score for prediction accuracy
r2 = r2_score(y_test, y_pred)
print('R² score on the test set:', r2)

# A side note: In regression tasks, we measure prediction quality with metrics like R². Unlike classification, accuracy score is not applicable.

复制代码

R² score on the test set: 0.9095060550803165

总结与未来工作

这篇内容对亚马逊销售数据集进行了全面的分析，从数据清洗和预处理开始，接着运用多种可视化技术进行了深入的探索性数据分析，最后还开发了一个能够预测总金额的回归模型。

多种可视化图表（热图、配对图、直方图、箱线图和细线图）揭示了数据中的关键模式，而预测模型也达到了一定的 R² 值，从而为订单总额的可预测性提供了初步的见解。

未来分析的方向可能包括：

对其他功能进行了试验，包括从日期字段中提取的定制功能（例如，月份或星期几）。

尝试使用更高级的建模技术，如集成方法或梯度提升，以提高预测的准确性。

进行更详细的错误分析，以了解并解决异常值或数据质量问题。