K-Means聚类算法原理
K-Means是一种基于距离的无监督学习算法,通过迭代将数据划分为K个簇。核心步骤包括:
- 初始化中心点:随机选择K个样本作为初始簇中心(质心)。
- 分配样本:计算每个样本到各质心的距离(通常为欧氏距离),将其分配到最近的簇。
- 更新质心:重新计算每个簇的均值作为新质心。
- 迭代优化:重复步骤2-3直至质心不再显著变化或达到最大迭代次数。
数学公式:
欧氏距离公式:
d(x, y) = \\sqrt{\\sum_{i=1}\^{n}(x_i - y_i)\^2}
质心更新公式(第j个簇的质心c_j):
c_j = \\frac{1}{\|S_j\|}\\sum_{x \\in S_j}x
实战步骤(Python示例)
数据准备
使用sklearn.datasets生成模拟数据:
python
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)
模型训练
调用sklearn.cluster.KMeans:
python
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
可视化结果
使用matplotlib绘制聚类效果:
gitee.com/huang-yong55/liusir/blob/master/t173.md
gitee.com/huang-yong55/liusir/blob/master/e460.md
gitee.com/huang-yong55/liusir/blob/master/w655.md
gitee.com/huang-yong55/liusir/blob/master/a175.md
gitee.com/huang-yong55/liusir/blob/master/v001.md
gitee.com/huang-yong55/liusir/blob/master/y901.md
gitee.com/huang-yong55/liusir/blob/master/k061.md
gitee.com/huang-yong55/liusir/blob/master/a821.md
gitee.com/huang-yong55/liusir/blob/master/w214.md
gitee.com/huang-yong55/liusir/blob/master/d497.md
gitee.com/huang-yong55/liusir/blob/master/p485.md
gitee.com/huang-yong55/liusir/blob/master/j816.md
gitee.com/huang-yong55/liusir/blob/master/s367.md
gitee.com/huang-yong55/liusir/blob/master/u675.md
gitee.com/huang-yong55/liusir/blob/master/n295.md
gitee.com/huang-yong55/liusir/blob/master/p980.md
gitee.com/huang-yong55/liusir/blob/master/s928.md
gitee.com/huang-yong55/liusir/blob/master/w151.md
gitee.com/huang-yong55/liusir/blob/master/b657.md
gitee.com/huang-yong55/liusir/blob/master/m285.md
gitee.com/huang-yong55/liusir/blob/master/y937.md
gitee.com/huang-yong55/liusir/blob/master/l508.md
gitee.com/huang-yong55/liusir/blob/master/w032.md
gitee.com/huang-yong55/liusir/blob/master/g574.md
gitee.com/huang-yong55/liusir/blob/master/e994.md
gitee.com/huang-yong55/liusir/blob/master/v552.md
gitee.com/huang-yong55/liusir/blob/master/l100.md
gitee.com/huang-yong55/liusir/blob/master/d742.md
gitee.com/huang-yong55/liusir/blob/master/m164.md
gitee.com/huang-yong55/liusir/blob/master/h626.md
gitee.com/huang-yong55/liusir/blob/master/o692.md
gitee.com/huang-yong55/liusir/blob/master/q199.md
gitee.com/huang-yong55/liusir/blob/master/t519.md
gitee.com/huang-yong55/liusir/blob/master/s361.md
gitee.com/huang-yong55/liusir/blob/master/h073.md
gitee.com/huang-yong55/liusir/blob/master/o779.md
gitee.com/huang-yong55/liusir/blob/master/x070.md
gitee.com/huang-yong55/liusir/blob/master/a085.md
gitee.com/huang-yong55/liusir/blob/master/q178.md
gitee.com/huang-yong55/liusir/blob/master/r767.md
gitee.com/huang-yong55/liusir/blob/master/g812.md
gitee.com/huang-yong55/liusir/blob/master/t874.md
gitee.com/huang-yong55/liusir/blob/master/h493.md
gitee.com/huang-yong55/liusir/blob/master/w330.md
gitee.com/huang-yong55/liusir/blob/master/v696.md
gitee.com/huang-yong55/liusir/blob/master/a790.md
gitee.com/huang-yong55/liusir/blob/master/z527.md
gitee.com/huang-yong55/liusir/blob/master/v296.md
gitee.com/huang-yong55/liusir/blob/master/b867.md
gitee.com/huang-yong55/liusir/blob/master/k032.md
python
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200)
plt.title("K-Means Clustering")
plt.show()
关键调参与优化
-
确定K值:
- 肘部法则(Elbow Method):观察不同K值下SSE(Sum of Squared Errors)的拐点。
- 轮廓系数(Silhouette Score):值越接近1表示聚类效果越好。
-
初始化改进:
- 使用
init='k-means++'(默认)避免随机初始化导致的局部最优。
- 使用
-
距离度量:
- 高维数据可考虑余弦相似度或马氏距离。
应用场景
- 客户分群(如RFM模型)
- 图像压缩(颜色量化)
- 异常检测(远离簇中心的样本)
局限性
- 需预先指定K值,且对非球形簇效果差。
- 对噪声和异常值敏感,可尝试K-Medoids或DBSCAN替代。