Comment on page
- 1.Sensitive to outliers, can skew results (because we rely on the mean)
- basically k-means with a most center object rather than a center virtual point that was based on mean distance from all points, we keep choosing medoids samples based on minimised SSE
- k-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori.
- Does Not scale to many samples, its O(K*n-K)^2
- Randomized resampling can assure efficiency and quality.
G-means Improves on X-means in the paper: The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution using the Anderson Darling test. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each enter. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center.
In k-means, you carry out the following procedure:
- specify k centroids, initialising their coordinates randomly
- calculate the distance of each data point to each centroid
- assign each data point to its nearest centroid
- update the coordinates of the centroid to the mean of all points assigned to it
- iterate until convergence.
In a GMM, you carry out the following procedure:
- specify k multivariate Gaussians (termed components), initialising their mean and variance randomly
- calculate the probability of each data point being produced by each component (sometimes termed the responsibility each component takes for the data point)
- assign each data point to the component it belongs to with the highest probability
- update the mean and variance of the component to the mean and variance of all data points assigned to it
- iterate until convergence
You may notice the similarity between these two procedures. In fact, k-means is a GMM with fixed-variance components. Under a GMM, the probabilities (I think) you're looking for are the responsibilities each component takes for each data point.
- 7.Optimized dbscans:
(what is?) HDBSCAN is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.
- 1.Transform the space according to the density/sparsity.
- 2.Build the minimum spanning tree of the distance weighted graph.
- 3.Construct a cluster hierarchy of connected components.
- 4.Condense the cluster hierarchy based on minimum cluster size.
- 5.Extract the stable clusters from the condensed tree.
- it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density.
- (How?) the points of the database are (linearly) ordered such that points which are spatially closest become neighbors in the ordering.
- a special distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster. (This is represented as a dendrogram.)
An SVM-based clustering algorithm is introduced that clusters data with no a priori knowledge of input classes.
- 1.The algorithm initializes by first running a binary SVM classifier against a data set with each vector in the set randomly labelled, this is repeated until an initial convergence occurs.
- 2.Once this initialization step is complete, the SVM confidence parameters for classification on each of the training instances can be accessed.
- 3.The lowest confidence data (e.g., the worst of the mislabelled data) then has its' labels switched to the other class label.
- 4.The SVM is then re-run on the data set (with partly re-labelled data) and is guaranteed to converge in this situation since it converged previously, and now it has fewer data points to carry with mislabelling penalties.
- 5.This approach appears to limit exposure to the local minima traps that can occur with other approaches. Thus, the algorithm then improves on its weakly convergent result by SVM re-training after each re-labeling on the worst of the misclassified vectors – i.e., those feature vectors with confidence factor values beyond some threshold.
- 6.The repetition of the above process improves the accuracy, here a measure of separability, until there are no misclassifications. Variations on this type of clustering approach are shown.
Clustering is traditionally viewed as an unsupervised method for data analysis. However, in some cases information about the problem domain is available in addition to the data instances themselves. In this paper, we demonstrate how the popular k-means clustering algorithm can be profitably modified to make use of this information. In experiments with artificial constraints on six data sets, we observe improvements in clustering accuracy. We also apply this method to the real-world problem of automatically detecting road lanes from GPS data and observe dramatic increases in performance.
In the context of partitioning algorithms, instance level constraints are a useful way to express a priori knowledge about which instances should or should not be grouped together. Consequently, we consider two types of pairwise constraints: • Must-link constraints specify that two instances have to be in the same cluster. • Cannot-link constraints specify that two instances must not be placed in the same cluster.