基于weka的数据库挖掘➖聚类方法AGNES算法

关于作者

作者介绍

🍓 博客主页：作者主页

🍓 简介：JAVA领域优质创作者🥇、一名初入职场小白🎓、曾在校期间参加各种省赛、国赛，斩获一系列荣誉 🏆

🍓 关注我：关注我学习资料、文档下载统统都有，每日定时更新文章，励志做一名JAVA资深程序猿👨‍💻

目标

掌握AGNES算法的原理和聚类过程

内容

采用AGNES算法，对给出的16个样本数据进行聚类，聚类簇数可自由调整，最后输出簇数为2、3、4的聚类结果。

层次聚类

层次聚类方法对给定的数据集进行层次的分解，直到某种条件满足为止。具体又可分为：

凝聚的层次聚类：一种自底向上的策略，首先将每个对象作为一个簇，然后合并这些原子簇为越来越大的簇，直到所有对象都在一个簇中，或者某个终结条件被满足。
分裂的层次聚类：采用自顶向下的策略，它首先将所有对象置于一个簇中，然后逐渐细分为越来越小的簇，直到每个对象自成一簇，或者达到了某个终结条件。

层次凝聚的代表是AGNES算法。层次分裂的代表是DIANA算法。

AGNES初识

初识AGNES算法（Agglomerative Nesting）是一种聚类算法，用于将数据集中的样本分组成不同的簇。AGNES算法是一种层次聚类算法，它从每个样本作为单独的簇开始，逐步合并最相似的簇，直到满足停止准则为止。

AGNES步骤

初始化：将每个样本作为单独的簇。
计算距离：计算每对簇之间的距离，常用的距离度量方法包括欧氏距离、曼哈顿距离等。
合并最相似的簇：选择距离最近的两个簇进行合并，形成一个新的簇。
更新距离矩阵：更新距离矩阵，反映新的簇与其他簇之间的距离。
重复步骤3和4，直到满足停止准则，例如达到指定的簇数目或距离阈值。

具体实现

Cluster.java

java 复制代码

public class Cluster {
    private List<DataPoint> dataPoints = new ArrayList<DataPoint>(); // 类簇中的样本点
    private String clusterName;

    public List<DataPoint> getDataPoints() {
        return dataPoints;
    }

    public void setDataPoints(List<DataPoint> dataPoints) {
        this.dataPoints = dataPoints;
    }

    public String getClusterName() {
        return clusterName;
    }

    public void setClusterName(String clusterName) {
        this.clusterName = clusterName;
    }
}

ClusterAnalysis.java

java 复制代码

public class ClusterAnalysis {

    public List<Cluster> startAnalysis(List<DataPoint> dataPoints,int ClusterNum){
        List<Cluster> finalClusters=new ArrayList<Cluster>();

        List<Cluster> originalClusters=initialCluster(dataPoints);
        finalClusters=originalClusters;
        while(finalClusters.size()>ClusterNum){
            double min=Double.MAX_VALUE;
            int mergeIndexA=0;
            int mergeIndexB=0;
            for(int i=0;i<finalClusters.size();i++){
                for(int j=0;j<finalClusters.size();j++){
                    if(i!=j){
                        Cluster clusterA=finalClusters.get(i);
                        Cluster clusterB=finalClusters.get(j);

                        List<DataPoint> dataPointsA=clusterA.getDataPoints();
                        List<DataPoint> dataPointsB=clusterB.getDataPoints();

                        for(int m=0;m<dataPointsA.size();m++){
                            for(int n=0;n<dataPointsB.size();n++){
                                double tempDis=getDistance(dataPointsA.get(m),dataPointsB.get(n));
                                if(tempDis<min){
                                    min=tempDis;
                                    mergeIndexA=i;
                                    mergeIndexB=j;
                                }
                            }
                        }
                    }
                } //end for j
            }// end for i
            //合并cluster[mergeIndexA]和cluster[mergeIndexB]
            finalClusters=mergeCluster(finalClusters,mergeIndexA,mergeIndexB);
        }//end while

        return finalClusters;
    }

    private List<Cluster> mergeCluster(List<Cluster> clusters,int mergeIndexA,int mergeIndexB){
        if (mergeIndexA != mergeIndexB) {
            // 将cluster[mergeIndexB]中的DataPoint加入到 cluster[mergeIndexA]
            Cluster clusterA = clusters.get(mergeIndexA);
            Cluster clusterB = clusters.get(mergeIndexB);

            List<DataPoint> dpA = clusterA.getDataPoints();
            List<DataPoint> dpB = clusterB.getDataPoints();

            for (DataPoint dp : dpB) {
                DataPoint tempDp = new DataPoint();
//                tempDp.setDataPointName(dp.getDataPointName());
//                tempDp.setDimensioin(dp.getDimensioin());
//                tempDp.setCluster(clusterA);
                tempDp = dp;
                tempDp.setCluster(clusterA);
                dpA.add(tempDp);
            }

            clusterA.setDataPoints(dpA);

            // List<Cluster> clusters中移除cluster[mergeIndexB]
            clusters.remove(mergeIndexB);
        }

        return clusters;
    }

    // 初始化类簇
    private List<Cluster> initialCluster(List<DataPoint> dataPoints){
        List<Cluster> originalClusters=new ArrayList<Cluster>();
        for(int i=0;i<dataPoints.size();i++){
            DataPoint tempDataPoint=dataPoints.get(i);
            List<DataPoint> tempDataPoints=new ArrayList<DataPoint>();
            tempDataPoints.add(tempDataPoint);

            Cluster tempCluster=new Cluster();
            tempCluster.setClusterName("Cluster "+String.valueOf(i));
            tempCluster.setDataPoints(tempDataPoints);

            tempDataPoint.setCluster(tempCluster);
            originalClusters.add(tempCluster);
        }

        return originalClusters;
    }

    //计算两个样本点之间的欧几里得距离
    private double getDistance(DataPoint dpA, DataPoint dpB){
        double distance=0;
        double[] dimA = dpA.getDimensioin();
        double[] dimB = dpB.getDimensioin();
        if (dimA.length == dimB.length) {
            for (int i = 0; i < dimA.length; i++) {
                double temp=Math.pow((dimA[i]-dimB[i]),2);
                distance=distance+temp;
            }
            distance=Math.pow(distance, 0.5);
        }
        return distance;
    }

    public static void main(String[] args){
        ArrayList<DataPoint> dpoints = new ArrayList<DataPoint>();
//初始化样本数据
        double[] a={2,3};
        double[] b={2,4};
        double[] c={1,4};
        double[] d={1,3};
        double[] e={2,2};
        double[] f={3,2};

        double[] g={8,7};
        double[] h={8,6};
        double[] i={7,7};
        double[] j={7,6};
        double[] k={8,5};

//       double[] l={100,2};//孤立点

        double[] m={8,20};
        double[] n={8,19};
        double[] o={7,18};
        double[] p={7,17};
        double[] q={8,20};

        dpoints.add(new DataPoint(a,"a"));
        dpoints.add(new DataPoint(b,"b"));
        dpoints.add(new DataPoint(c,"c"));
        dpoints.add(new DataPoint(d,"d"));
        dpoints.add(new DataPoint(e,"e"));
        dpoints.add(new DataPoint(f,"f"));

        dpoints.add(new DataPoint(g,"g"));
        dpoints.add(new DataPoint(h,"h"));
        dpoints.add(new DataPoint(i,"i"));
        dpoints.add(new DataPoint(j,"j"));
        dpoints.add(new DataPoint(k,"k"));

        dpoints.add(new DataPoint(m,"m"));
        dpoints.add(new DataPoint(n,"n"));
        dpoints.add(new DataPoint(o,"o"));
        dpoints.add(new DataPoint(p,"p"));
        dpoints.add(new DataPoint(q,"q"));

        int clusterNum=2; //类簇数
        ClusterAnalysis ca=new ClusterAnalysis();
        List<Cluster> clusters=ca.startAnalysis(dpoints, clusterNum);
        for(Cluster cl:clusters){
            System.out.println("------"+cl.getClusterName()+"------");
            List<DataPoint> tempDps=cl.getDataPoints();
            for(DataPoint tempdp:tempDps){
                System.out.println(tempdp.getDataPointName());
            }
        }

    }
}

DataPoint.java

java 复制代码

public class DataPoint {
    String dataPointName; // 样本点名
    Cluster cluster; // 样本点所属类簇
    private double dimensioin[]; // 样本点的维度
    public DataPoint() {
    }
    public DataPoint(double[] dimensioin, String dataPointName) {
        this.dataPointName = dataPointName;
        this.dimensioin = dimensioin;
    }
    public double[] getDimensioin() {
        return dimensioin;
    }
    public void setDimensioin(double[] dimensioin) {
        this.dimensioin = dimensioin;
    }
    public Cluster getCluster() {
        return cluster;
    }
    public void setCluster(Cluster cluster) {
        this.cluster = cluster;
    }
    public String getDataPointName() {
        return dataPointName;
    }
    public void setDataPointName(String dataPointName) {
        this.dataPointName = dataPointName;
    }
}

结果如下

优缺点

优点

简单易实现：AGNES算法的实现相对简单，不需要预先指定簇的数量，因此适用于初学者或快速原型开发。
层次结构信息：AGNES算法生成的聚类结果是一个层次结构，可以提供更丰富的信息。通过层次结构，可以得到不同层次的聚类结果，从整体到局部的视角分析数据。
不受初始参数影响：AGNES算法不需要初始聚类中心或其他参数，因此对初始条件不敏感。

缺点

计算复杂度高：AGNES算法的计算复杂度较高，特别是在处理大规模数据时。需要计算每对簇之间的距离，导致时间和空间开销较大。
对离群点敏感：AGNES算法对离群点比较敏感，离群点可能会影响最终的聚类结果。
难以处理大规模数据：由于需要计算每对簇之间的距离，当数据规模较大时，算法的效率会受到限制。
不适用于非凸形状的簇：AGNES算法倾向于生成凸形状的簇，对于非凸形状的簇效果可能不理想。