基于Kmeans,对鸢尾花数据集前两个特征进行聚类分析
-
通过迭代优化,将150个样本划分到K个簇中。
-
目标函数:最小化所有样本到其所属簇中心的距离平方和。
-
算法步骤:
-
随机初始化K个簇中心。
-
将每个样本分配到最近的中心。
-
计算均值确定每个簇的中心(均值)。
-
重复第2和3步直到稳定收敛。
-
程序代码:
python
import math
import numpy as np
from matplotlib import pyplot as plt
from sklearn import datasets
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
data = datasets.load_iris().data
labels = datasets.load_iris().target
print('数据维度',data.shape)
features = data[:,: 2]
print('特征',features)
num_clusters = 6
epoch = 150
J_sum = []
def J_calculate(features,divide_re,center):
J = 0
for s1 in range(150):
distances = ((features[s1][0]-center[divide_re[s1]][0]) ** 2) + ((features[s1][1]-center[divide_re[s1]][1]) ** 2)
#print(distances)
J = J + distances
return J
def decision(features,divide_re,center,epoch):
J_best = []
for _ in range(epoch):
J_b = math.inf
for s1 in range(150):
best = None
min_J_now = math.inf
for s2 in range(len(center)):
divide_re[s1] = s2
J_now = J_calculate(features,divide_re,center)
if J_now < min_J_now:
min_J_now = J_now
best = s2
divide_re[s1] = best
for i in range(len(center)):
xc = []
yc = []
for j in range(150):
if (divide_re[j] == i):
xc.append(features[j][0])
yc.append(features[j][1])
center[i] = [np.mean(xc), np.mean(yc)]
if(min_J_now<J_b):
J_b = min_J_now
J_best.append(J_b)
return features,divide_re,center,J_best
for i in range(2,num_clusters+1):
print(f'\n分{i}类:\n')
center = features[np.random.choice(features.shape[0], i, replace=False)]
print("初始中心点", center)
distances = np.linalg.norm(features[:, np.newaxis, :] - center, axis=2)
divide = np.argmin(distances,axis=1)
divide_re = []
for x in range(150):
divide_re.append(divide[x])
print("初始样本分类", divide_re)
features,divide_re,center,J_best = decision(features,divide_re,center,epoch)
print(f'{i}类最佳J值为:',J_best[epoch-1])
J_sum.append(J_best[epoch-1])
plt.scatter(features[:, 0], features[:, 1], c=divide_re, cmap='viridis', edgecolors='k')
plt.scatter(center[:, 0], center[:, 1], marker='x', s=30, linewidths=3, color='red')
plt.title(f'{i}类C均值分类法结果')
plt.xlabel('第一特征')
plt.ylabel('第二特征')
plt.show()
plt.figure()
plt.plot(range(2, num_clusters + 1), J_sum, marker='o')
plt.title('J与类别数量关系曲线')
plt.xlabel('类别数量')
plt.ylabel('J_sum 值')
plt.show()
运行结果:
数据维度 (150, 4)
特征 [[5.1 3.5]
4.9 3.
4.7 3.2
4.6 3.1
5. 3.6
5.4 3.9
4.6 3.4
5. 3.4
4.4 2.9
4.9 3.1
5.4 3.7
4.8 3.4
4.8 3.
4.3 3.
5.8 4.
5.7 4.4
5.4 3.9
5.1 3.5
5.7 3.8
5.1 3.8
5.4 3.4
5.1 3.7
4.6 3.6
5.1 3.3
4.8 3.4
5. 3.
5. 3.4
5.2 3.5
5.2 3.4
4.7 3.2
4.8 3.1
5.4 3.4
5.2 4.1
5.5 4.2
4.9 3.1
5. 3.2
5.5 3.5
4.9 3.6
4.4 3.
5.1 3.4
5. 3.5
4.5 2.3
4.4 3.2
5. 3.5
5.1 3.8
4.8 3.
5.1 3.8
4.6 3.2
5.3 3.7
5. 3.3
7. 3.2
6.4 3.2
6.9 3.1
5.5 2.3
6.5 2.8
5.7 2.8
6.3 3.3
4.9 2.4
6.6 2.9
5.2 2.7
5. 2.
5.9 3.
6. 2.2
6.1 2.9
5.6 2.9
6.7 3.1
5.6 3.
5.8 2.7
6.2 2.2
5.6 2.5
5.9 3.2
6.1 2.8
6.3 2.5
6.1 2.8
6.4 2.9
6.6 3.
6.8 2.8
6.7 3.
6. 2.9
5.7 2.6
5.5 2.4
5.5 2.4
5.8 2.7
6. 2.7
5.4 3.
6. 3.4
6.7 3.1
6.3 2.3
5.6 3.
5.5 2.5
5.5 2.6
6.1 3.
5.8 2.6
5. 2.3
5.6 2.7
5.7 3.
5.7 2.9
6.2 2.9
5.1 2.5
5.7 2.8
6.3 3.3
5.8 2.7
7.1 3.
6.3 2.9
6.5 3.
7.6 3.
4.9 2.5
7.3 2.9
6.7 2.5
7.2 3.6
6.5 3.2
6.4 2.7
6.8 3.
5.7 2.5
5.8 2.8
6.4 3.2
6.5 3.
7.7 3.8
7.7 2.6
6. 2.2
6.9 3.2
5.6 2.8
7.7 2.8
6.3 2.7
6.7 3.3
7.2 3.2
6.2 2.8
6.1 3.
6.4 2.8
7.2 3.
7.4 2.8
7.9 3.8
6.4 2.8
6.3 2.8
6.1 2.6
7.7 3.
6.3 3.4
6.4 3.1
6. 3.
6.9 3.1
6.7 3.1
6.9 3.1
5.8 2.7
6.8 3.2
6.7 3.3
6.7 3.
6.3 2.5
6.5 3.
6.2 3.4
5.9 3. \]
分2类:
初始中心点 [[6.4 3.1]
7.2 3.6\]
初始样本分类 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
2类最佳J值为: 58.20409278906674
分3类:
初始中心点 [[5.4 3.4]
5.4 3.4
7.7 2.8\]
初始样本分类 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 2, 2, 2, 0, 0, 2, 0, 0, 0, 0, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 0, 2, 2, 2, 0, 0, 0, 2, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 0, 0, 0]
3类最佳J值为: 58.20409278906674
分4类:
初始中心点 [[6.7 3.1]
6.4 2.7
6.5 3.2
5.5 2.4\]
初始样本分类 [3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 0, 2, 0, 3, 1, 3, 2, 3, 0, 3, 3, 1, 3, 1, 3, 0, 3, 3, 1, 3, 2, 1, 1, 1, 1, 0, 0, 0, 1, 3, 3, 3, 3, 1, 3, 2, 0, 1, 3, 3, 3, 1, 3, 3, 3, 3, 3, 1, 3, 3, 2, 3, 0, 1, 2, 0, 3, 0, 1, 0, 2, 1, 0, 3, 3, 2, 2, 0, 0, 3, 0, 3, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 2, 2, 1, 0, 0, 0, 3, 0, 0, 0, 1, 2, 2, 1]
4类最佳J值为: 28.23339146670904
分5类:
初始中心点 [[6.3 2.5]
5.1 3.5
6.4 3.2
7.1 3.
5.5 3.5\]
初始样本分类 [1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 4, 1, 1, 1, 4, 4, 4, 1, 4, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 4, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 3, 0, 0, 0, 2, 1, 2, 1, 0, 2, 0, 2, 4, 2, 4, 0, 0, 0, 2, 0, 0, 0, 2, 2, 3, 2, 0, 0, 0, 0, 0, 0, 4, 2, 2, 0, 4, 0, 0, 2, 0, 1, 0, 4, 4, 2, 1, 0, 2, 0, 3, 2, 2, 3, 1, 3, 0, 3, 2, 0, 3, 0, 0, 2, 2, 3, 3, 0, 3, 4, 3, 0, 2, 3, 0, 2, 0, 3, 3, 3, 0, 0, 0, 3, 2, 2, 2, 3, 2, 3, 0, 3, 2, 2, 0, 2, 2, 2]
5类最佳J值为: 21.200013093214928
分6类:
初始中心点 [[6.8 2.8]
5.8 2.6
4.4 3.
6.2 3.4
6.4 3.2
6. 3. \]
初始样本分类 [2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 3, 3, 3, 2, 3, 2, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 3, 3, 2, 2, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 0, 4, 0, 1, 0, 1, 3, 2, 0, 1, 1, 5, 1, 5, 1, 4, 5, 1, 1, 1, 5, 5, 1, 5, 4, 4, 0, 0, 5, 1, 1, 1, 1, 1, 1, 3, 4, 1, 5, 1, 1, 5, 1, 1, 1, 5, 1, 5, 1, 1, 3, 1, 0, 5, 4, 0, 2, 0, 0, 4, 4, 0, 0, 1, 1, 4, 4, 0, 0, 1, 0, 1, 0, 5, 4, 0, 5, 5, 0, 0, 0, 0, 0, 5, 1, 0, 3, 4, 5, 0, 4, 0, 1, 4, 4, 0, 1, 4, 3, 5]
6类最佳J值为: 18.150987445152886
进程已结束,退出代码0