The K-Town Zoological Institute has 100 types of animals on
exhibit (including humans – the zookeepers!) The Board of Directors Fortunately the zoo is dedicated to the ethical treatment of animals and participates in many international programs of conservation and preservation, so we
need have no worries about unintentionally becoming stars of the next Tiger
King series. has decided that the animals should be grouped together on the basis of shared characteristics. The zoo has 7 buildings, so you are tasked with separating the species into 7 groups. There are 15 characteristics, each representing an aspect of the
animal. Most (such as “airborne” and “milk”) are either 0 or 1. But some (such as “legs”) can take higher values. All the characteristic values are integers. The method you will use to group the animals is called k-means clustering. It works like this: Choose the number of clusters (groups) you want. In this case we are told the number of groups must be 7. Choose 7 animals at random. These are the initial “cluster centres”. Then repeat the following steps:
动物的一个方面。大多数（如 “空气传播 “和 “牛奶”）不是0就是1。但是 有些（如 “腿”）可以取更高的值。所有的特征 值都是整数。你将使用的对动物进行分组的方法叫做k-means 聚类法。它的工作原理是这样的。选择你想要的聚类（组）的数量。在本例中，我们 我们被告知组的数量必须是7。随机选择7个动物。这些是最初的 “集群中心”。然后重复下面的步骤。
1.For each animal, determine which of the cluster centres it is closest to.1 This separates the animals into 7 clusters. 1 See below under Manhattan and Euclid for a discussion of “closest”.
2. For each cluster, compute a new centre by averaging the values of each characteristic for the animals currently in the cluster. This new centre will probably have non-integer values for its characteristics and will almost certainly not correspond exactly to any of the animals. That’s ok!
This gives us 7 new cluster centres, so we can go back to Step
1 and repeat the process of assigning the animals to clusters by choosing the nearest cluster centre.
3. We repeat this process until the clusters stabilize (i.e. they stop changing or almost stop changing). I’ll describe ways of measuring stability later in this assignment.
Suppose we have just 5 animals, and they only have 3 characteristics A, B and C. We can represent the data like this (I picked random numbers for the characteristic values – in the actual data set the values are not randomly assigned):
Suppose that we decide to separate these animals into 2 groups based on the 3 characteristics. We randomly choose 2 initial centres (I’ll pick Horse and Moose), and compute which of these is closer for each of the other animals. I’m using a simple distance measure called the Manhattan Metric2. Using that metric, Cow is closer to 2 See below Horse, and Bear and Duck are both closer to Moose. This gives us two clusters: [Horse, Cow] and [Moose, Bear, Duck]. We compute the new centre for each cluster by averaging the characteristic values for the animals in the cluster. So the updated centre for Cluster 1 is the set of characteristic values [9/2, 2, 7/2] and the updated centre for Cluster 2 is [5/3, 16/3, 8/3]
Now for all the animals we compute which of the new cluster centres they are closest to. None of them switches to the other cluster, so the clusters do not change. We are done – but if any of the animals had switched to the other cluster we would have iterated again, calculating new centres and seeing if the clusters changed again.
To see this in action, suppose we choose Bear and Duck as the two initial centres. The initial clusters are [Bear] and [Duck, Moose, Horse, Cow]. The updated centre for Cluster 1 is [2, 6, 1] and the updated centre for Cluster 2 is [3, 3.5, 3.5]. When we recompute the distances from the animals to the cluster centres, Duck jumps to Cluster 1, giving [Bear, Duck] and [Moose, Horse, Cow]. Computing the new centres and determining which centre is closest for each animals creates no further changes, so the situation is now stable and we stop.
This example illustrates that the algorithm can settle on different solutions, depending on the starting points. We typically run the algorithm several times with different randomly chosen starting points to see if one solution comes up more frequently than others.