【MATLAB 数据分析学习指南】

MATLAB 数据分析学习指南

探索数据科学的前沿技术，从基础到高级应用

指南简介

欢迎来到 MATLAB 数据分析学习指南！本指南将带你从数据分析的基础概念开始，逐步深入到高级应用，通过实际案例帮助你掌握 MATLAB 在数据科学领域的强大功能。

学习路径

1.数据导入与探索：学习如何导入各种格式的数据并进行初步探索

2.数据清洗与预处理：掌握处理缺失值、异常值和数据转换的方法

3.数据可视化：使用 MATLAB 创建专业的数据可视化图表

4.统计分析：应用统计方法从数据中提取有价值的信息

为什么选择 MATLAB 进行数据分析？

MATLAB 提供了强大的数学计算能力和丰富的数据分析工具箱，特别适合处理工程和科学数据。其直观的语法和可视化功能使得数据分析过程更加高效。

快速示例：生成和分析数据

matlab 复制代码

% 导入CSV文件
data = readtable('data.csv');

% 导入Excel文件
data = readtable('data.xlsx', 'Sheet', 'Sheet1');

% 导出数据到CSV
writetable(data, 'processed_data.csv');

数据分析基础

在开始数据分析之前，我们需要了解 MATLAB 中处理数据的基本操作。

数据导入与导出

MATLAB 支持多种数据格式的导入，包括 CSV、Excel、文本文件等。

matlab 复制代码

% 导入CSV文件
data = readtable('data.csv');

% 导入Excel文件
data = readtable('data.xlsx', 'Sheet', 'Sheet1');

% 导出数据到CSV
writetable(data, 'processed_data.csv');

数据探索

了解数据的基本特征和统计信息是数据分析的第一步。

matlab 复制代码

% 查看数据基本信息
summary(data)

% 计算基本统计量
mean_values = mean(data{:,:});
std_values = std(data{:,:});
min_values = min(data{:,:});
max_values = max(data{:,:});

% 显示前几行数据
head(data)

示例：探索鸢尾花数据集

matlab 复制代码

% 加载内置的鸢尾花数据集
load fisheriris

% 查看数据维度
disp(['数据维度: ', num2str(size(meas))])

% 计算各特征的统计信息
feature_means = mean(meas);
feature_stds = std(meas);

% 显示统计结果
fprintf('特征平均值: %.2f, %.2f, %.2f, %.2f\n', feature_means)
fprintf('特征标准差: %.2f, %.2f, %.2f, %.2f\n', feature_stds)

预期输出：

数据维度: 150 4
特征平均值: 5.84, 3.06, 3.76, 1.20
特征标准差: 0.83, 0.44, 1.77, 0.76

数据预处理

数据预处理是数据分析中至关重要的一步，可以显著提高分析结果的准确性。

处理缺失值

MATLAB 提供了多种处理缺失值的方法。

matlab 复制代码

% 创建含有缺失值的数据示例
data = [1, 2, NaN, 4, 5, NaN, 7, 8, 9, 10];

% 查找缺失值
missing_indices = isnan(data);

% 使用均值填充缺失值
data_filled = data;
data_filled(missing_indices) = mean(data(~missing_indices));

% 使用插值法填充缺失值
data_interp = fillmissing(data, 'linear');

数据标准化

将数据转换为标准格式，便于后续分析。

matlab 复制代码

% Z-score标准化
data_zscore = zscore(data);

% 最小-最大标准化
data_minmax = (data - min(data)) / (max(data) - min(data));

示例：预处理股票价格数据

matlab 复制代码

% 生成模拟股票价格数据（含有缺失值和异常值）
dates = datetime(2020,1,1):caldays(1):datetime(2020,12,31);
prices = 100 + cumsum(randn(length(dates),1)*2);

% 随机添加缺失值
missing_indices = randperm(length(prices), 10);
prices(missing_indices) = NaN;

% 随机添加异常值
outlier_indices = randperm(length(prices), 5);
prices(outlier_indices) = prices(outlier_indices) * 2;

% 处理缺失值 - 使用前向填充
prices_filled = fillmissing(prices, 'previous');

% 检测并处理异常值 - 使用3σ原则
mean_price = mean(prices_filled);
std_price = std(prices_filled);
outliers = abs(prices_filled - mean_price) > 3*std_price;
prices_clean = prices_filled;
prices_clean(outliers) = mean_price;

% 计算日收益率
returns = diff(prices_clean) ./ prices_clean(1:end-1);

fprintf('处理后的数据统计:\n');
fprintf('均值: %.2f\n', mean(prices_clean));
fprintf('标准差: %.2f\n', std(prices_clean));
fprintf('缺失值数量: %d\n', sum(isnan(prices_clean)));
fprintf('异常值数量: %d\n', sum(outliers));

数据可视化

数据可视化是理解和传达数据洞察力的关键工具。

基本图表

MATLAB 提供了丰富的绘图函数来创建各种类型的图表。

matlab 复制代码

% 创建示例数据
x = 1:10;
y1 = x.^2;
y2 = 2*x + 5;

% 创建线图
figure;
plot(x, y1, 'b-o', x, y2, 'r--s', 'LineWidth', 2);
legend('y = x^2', 'y = 2x+5');
xlabel('X轴');
ylabel('Y轴');
title('函数对比图');
grid on;

高级可视化

使用更复杂的图表来展示多维数据。

matlab 复制代码

% 散点图矩阵
load fisheriris
figure;
gplotmatrix(meas, [], species);

% 3D散点图
figure;
scatter3(meas(:,1), meas(:,2), meas(:,3), 50, grp2idx(species), 'filled');
xlabel('花萼长度');
ylabel('花萼宽度');
zlabel('花瓣长度');
title('鸢尾花数据3D可视化');
colorbar;

示例：创建交互式数据仪表板

matlab 复制代码

% 创建示例销售数据
months = {'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', ...
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'};
sales = [120, 135, 148, 165, 180, 210, 240, 235, 220, 200, 185, 160];
expenses = [80, 85, 90, 95, 100, 110, 120, 115, 105, 100, 95, 90];
profit = sales - expenses;

% 创建仪表板
figure('Position', [100, 100, 1200, 600]);

% 子图1：销售与支出趋势
subplot(2,3,1);
plot(1:12, sales, 'b-o', 'LineWidth', 2);
hold on;
plot(1:12, expenses, 'r-s', 'LineWidth', 2);
xlabel('月份');
ylabel('金额 (千元)');
title('销售与支出趋势');
legend('销售额', '支出额', 'Location', 'northwest');
set(gca, 'XTick', 1:12, 'XTickLabel', months);
grid on;

% 子图2：利润柱状图
subplot(2,3,2);
bar(1:12, profit, 'FaceColor', [0.2, 0.6, 0.8]);
xlabel('月份');
ylabel('利润 (千元)');
title('月度利润');
set(gca, 'XTick', 1:12, 'XTickLabel', months);
grid on;

% 子图3：饼图 - 销售占比
subplot(2,3,3);
pie(sales, months);
title('各月销售占比');

% 子图4：累积利润
subplot(2,3,4);
cumulative_profit = cumsum(profit);
area(1:12, cumulative_profit, 'FaceColor', [0.4, 0.8, 0.4], 'FaceAlpha', 0.7);
xlabel('月份');
ylabel('累积利润 (千元)');
title('累积利润趋势');
set(gca, 'XTick', 1:12, 'XTickLabel', months);
grid on;

% 子图5：销售与支出的散点图
subplot(2,3,5);
scatter(sales, expenses, 100, 1:12, 'filled');
xlabel('销售额');
ylabel('支出额');
title('销售与支出关系');
colorbar;
grid on;

% 子图6：箱线图
subplot(2,3,6);
boxplot([sales', expenses'], 'Labels', {'销售额', '支出额'});
ylabel('金额 (千元)');
title('数据分布');

统计分析

使用统计方法从数据中提取有意义的信息和模式。

描述性统计

了解数据的基本特征和分布。

matlab 复制代码

% 生成示例数据
data = randn(1000,1) * 10 + 50; % 均值为50，标准差为10的正态分布

% 计算描述性统计量
data_mean = mean(data);
data_median = median(data);
data_std = std(data);
data_skewness = skewness(data);
data_kurtosis = kurtosis(data);

% 显示结果
fprintf('均值: %.2f\n', data_mean);
fprintf('中位数: %.2f\n', data_median);
fprintf('标准差: %.2f\n', data_std);
fprintf('偏度: %.2f\n', data_skewness);
fprintf('峰度: %.2f\n', data_kurtosis);

假设检验

使用统计检验验证关于数据的假设。

matlab 复制代码

% 生成两组示例数据
group1 = randn(50,1) * 5 + 10;
group2 = randn(50,1) * 5 + 12;

% t检验 - 检验两组均值是否相等
[h, p, ci, stats] = ttest2(group1, group2);

fprintf('t检验结果:\n');
fprintf('假设检验结果 h: %d (1表示拒绝原假设)\n', h);
fprintf('p值: %.4f\n', p);
fprintf('置信区间: [%.2f, %.2f]\n', ci(1), ci(2));
fprintf('t统计量: %.2f\n', stats.tstat);

示例：回归分析

matlab 复制代码

% 生成示例数据 - 房屋面积与价格的关系
rng(42); % 设置随机种子以确保结果可重现
area = 50 + 200*rand(100,1); % 房屋面积 (50-250平方米)
noise = randn(100,1) * 20;
price = 50 + 2*area + noise; % 房屋价格 (万元)

% 执行线性回归
X = [ones(size(area)), area]; % 添加截距项
[b, bint, r, rint, stats] = regress(price, X);

% 显示回归结果
fprintf('回归方程: 价格 = %.2f + %.2f * 面积\n', b(1), b(2));
fprintf('R平方: %.4f\n', stats(1));
fprintf('F统计量: %.2f\n', stats(2));
fprintf('p值: %.4f\n', stats(3));

% 绘制回归结果
figure;
scatter(area, price, 50, 'filled', 'MarkerFaceAlpha', 0.6);
hold on;
x_fit = linspace(min(area), max(area), 100);
y_fit = b(1) + b(2)*x_fit;
plot(x_fit, y_fit, 'r-', 'LineWidth', 2);
xlabel('房屋面积 (平方米)');
ylabel('房屋价格 (万元)');
title('房屋面积与价格关系');
legend('实际数据', '回归线', 'Location', 'northwest');
grid on;

% 计算预测值
new_area = 150; % 新房屋面积
predicted_price = b(1) + b(2)*new_area;
fprintf('\n预测: %.0f平方米的房屋价格约为 %.2f万元\n', new_area, predicted_price);

完整案例：客户细分分析

在这个完整案例中，我们将使用聚类分析对客户进行细分。

问题描述

一家零售公司希望根据客户的购买行为对其进行细分，以便制定更有针对性的营销策略。

数据准备

matlab 复制代码

% 生成模拟客户数据
rng(123); % 设置随机种子以确保结果可重现
n_customers = 300;

% 生成三个不同的客户群体
% 群体1: 低频率、低消费
group1_freq = 2 + randn(n_customers/3,1)*0.5;
group1_spend = 50 + randn(n_customers/3,1)*10;

% 群体2: 中等频率、中等消费
group2_freq = 5 + randn(n_customers/3,1)*1;
group2_spend = 150 + randn(n_customers/3,1)*30;

% 群体3: 高频率、高消费
group3_freq = 10 + randn(n_customers/3,1)*2;
group3_spend = 400 + randn(n_customers/3,1)*50;

% 合并数据
customer_freq = [group1_freq; group2_freq; group3_freq];
customer_spend = [group1_spend; group2_spend; group3_spend];

% 创建数据矩阵
customer_data = [customer_freq, customer_spend];

% 添加一些随机噪声使数据更真实
customer_data = customer_data + randn(size(customer_data)) .* [0.2, 5];

数据探索

matlab 复制代码

% 绘制原始数据
figure;
scatter(customer_freq, customer_spend, 50, 'filled', 'MarkerFaceAlpha', 0.6);
xlabel('购买频率 (次/月)');
ylabel('平均消费金额 (元)');
title('客户购买行为分布');
grid on;

% 计算描述性统计
fprintf('购买频率统计:\n');
fprintf('  均值: %.2f\n', mean(customer_freq));
fprintf('  标准差: %.2f\n', std(customer_freq));
fprintf('  最小值: %.2f\n', min(customer_freq));
fprintf('  最大值: %.2f\n\n', max(customer_freq));

fprintf('消费金额统计:\n');
fprintf('  均值: %.2f\n', mean(customer_spend));
fprintf('  标准差: %.2f\n', std(customer_spend));
fprintf('  最小值: %.2f\n', min(customer_spend));
fprintf('  最大值: %.2f\n', max(customer_spend));

数据标准化

matlab 复制代码

% 标准化数据（聚类分析通常需要标准化）
customer_data_scaled = zscore(customer_data);

确定最佳聚类数量

matlab 复制代码

% 使用肘部法则确定最佳聚类数量
max_clusters = 8;
inertia = zeros(max_clusters, 1);

for k = 1:max_clusters
    [idx, ~, sumd] = kmeans(customer_data_scaled, k);
    inertia(k) = sum(sumd);
end

% 绘制肘部图
figure;
plot(1:max_clusters, inertia, 'bo-', 'LineWidth', 2, 'MarkerSize', 8);
xlabel('聚类数量');
ylabel('簇内平方和');
title('肘部法则 - 确定最佳聚类数量');
grid on;

执行 K-means 聚类

matlab 复制代码

% 基于肘部图选择3个聚类
k = 3;
[idx, centroids] = kmeans(customer_data_scaled, k);

% 将标准化中心点转换回原始尺度
centroids_original = centroids .* std(customer_data) + mean(customer_data);

结果可视化与分析

matlab 复制代码

% 绘制聚类结果
figure;
colors = ['r', 'g', 'b', 'c', 'm', 'y'];
for i = 1:k
    cluster_points = customer_data(idx == i, :);
    scatter(cluster_points(:,1), cluster_points(:,2), 70, colors(i), 'filled', 'MarkerFaceAlpha', 0.7);
    hold on;
end

% 绘制聚类中心
scatter(centroids_original(:,1), centroids_original(:,2), 200, 'k', 'x', 'LineWidth', 3);
legend('群体 1', '群体 2', '群体 3', '聚类中心', 'Location', 'northwest');
xlabel('购买频率 (次/月)');
ylabel('平均消费金额 (元)');
title('客户细分结果');
grid on;

% 分析每个群体的特征
fprintf('\n--- 客户群体分析 ---\n');
for i = 1:k
    cluster_data = customer_data(idx == i, :);
    fprintf('\n群体 %d (共%d名客户):\n', i, size(cluster_data,1));
    fprintf('  平均购买频率: %.2f 次/月\n', mean(cluster_data(:,1)));
    fprintf('  平均消费金额: %.2f 元\n', mean(cluster_data(:,2)));
    
    % 根据特征给群体命名
    avg_freq = mean(cluster_data(:,1));
    avg_spend = mean(cluster_data(:,2));
    
    if avg_freq < 4 && avg_spend < 100
        fprintf('  群体特征: 低价值客户\n');
    elseif avg_freq < 7 && avg_spend < 250
        fprintf('  群体特征: 中等价值客户\n');
    else
        fprintf('  群体特征: 高价值客户\n');
    end
end