“聚类算法“试图将数据集中的样本划分为若干个通常是不相交的子集,每个子集称为一个“簇”(cluster),通过这样的划分,每个簇可能对应于一些潜在的概念或类别。
kmeans算法又名k均值算法,K-means算法中的k表示的是聚类为k个簇,means代表取每一个聚类中数据值的均值作为该簇的中心,或者称为质心,即用每一个的类的质心对该簇进行描述。
计算机作业代写-K-means聚类算法案例
The K-Town Zoological Institute has 100 types of animals on
exhibit (including humans – the zookeepers!) The Board of Directors Fortunately the zoo is dedicated to the ethical treatment of animals and participates in many international programs of conservation and preservation, so we
need have no worries about unintentionally becoming stars of the next Tiger
King series. has decided that the animals should be grouped together on the basis of shared characteristics. The zoo has 7 buildings, so you are tasked with separating the species into 7 groups. There are 15 characteristics, each representing an aspect of the
animal. Most (such as “airborne” and “milk”) are either 0 or 1. But some (such as “legs”) can take higher values. All the characteristic values are integers. The method you will use to group the animals is called k-means clustering. It works like this: Choose the number of clusters (groups) you want. In this case we are told the number of groups must be 7. Choose 7 animals at random. These are the initial “cluster centres”. Then repeat the following steps:
K城动物研究所有100种动物展出(包括人类–动物园管理员!)。
幸运的是,动物园的董事会致力于动物的道德待遇,并参与了许多国际保护和保存计划,所以我们不必担心无意中成为下一个老虎的明星。
不用担心会无意中成为下一个虎王系列的明星。
动物园决定,应根据共同的特征将动物归类在一起。动物园有7座建筑,所以你的任务是将物种分成7组。有15个特征,每个特征代表动物的一个方面。
动物的一个方面。大多数(如 “空气传播 “和 “牛奶”)不是0就是1。但是 有些(如 “腿”)可以取更高的值。所有的特征 值都是整数。你将使用的对动物进行分组的方法叫做k-means 聚类法。它的工作原理是这样的。选择你想要的聚类(组)的数量。在本例中,我们 我们被告知组的数量必须是7。随机选择7个动物。这些是最初的 “集群中心”。然后重复下面的步骤。
1.For each animal, determine which of the cluster centres it is closest to.1 This separates the animals into 7 clusters. 1 See below under Manhattan and Euclid for a discussion of “closest”.
2. For each cluster, compute a new centre by averaging the values of each characteristic for the animals currently in the cluster. This new centre will probably have non-integer values for its characteristics and will almost certainly not correspond exactly to any of the animals. That’s ok!
This gives us 7 new cluster centres, so we can go back to Step
1 and repeat the process of assigning the animals to clusters by choosing the nearest cluster centre.
3. We repeat this process until the clusters stabilize (i.e. they stop changing or almost stop changing). I’ll describe ways of measuring stability later in this assignment.
Example:
Suppose we have just 5 animals, and they only have 3 characteristics A, B and C. We can represent the data like this (I picked random numbers for the characteristic values – in the actual data set the values are not randomly assigned):
Suppose that we decide to separate these animals into 2 groups based on the 3 characteristics. We randomly choose 2 initial centres (I’ll pick Horse and Moose), and compute which of these is closer for each of the other animals. I’m using a simple distance measure called the Manhattan Metric2. Using that metric, Cow is closer to 2 See below Horse, and Bear and Duck are both closer to Moose. This gives us two clusters: [Horse, Cow] and [Moose, Bear, Duck]. We compute the new centre for each cluster by averaging the characteristic values for the animals in the cluster. So the updated centre for Cluster 1 is the set of characteristic values [9/2, 2, 7/2] and the updated centre for Cluster 2 is [5/3, 16/3, 8/3]
Now for all the animals we compute which of the new cluster centres they are closest to. None of them switches to the other cluster, so the clusters do not change. We are done – but if any of the animals had switched to the other cluster we would have iterated again, calculating new centres and seeing if the clusters changed again.
To see this in action, suppose we choose Bear and Duck as the two initial centres. The initial clusters are [Bear] and [Duck, Moose, Horse, Cow]. The updated centre for Cluster 1 is [2, 6, 1] and the updated centre for Cluster 2 is [3, 3.5, 3.5]. When we recompute the distances from the animals to the cluster centres, Duck jumps to Cluster 1, giving [Bear, Duck] and [Moose, Horse, Cow]. Computing the new centres and determining which centre is closest for each animals creates no further changes, so the situation is now stable and we stop.
This example illustrates that the algorithm can settle on different solutions, depending on the starting points. We typically run the algorithm several times with different randomly chosen starting points to see if one solution comes up more frequently than others.