I recently submitted a paper named DISCERN with my research partner and truly one of the greatest professors in our institution. DISCERN is a partitional clustering algorithm which attempts to solve some common problems with the original K-Means and K-Means++. It is no secret that density-based and hierarchical clustering measures have their own specific uses. For instance, no other algorithm comes close to hierarchical clustering when used for text mining, in terms of results. K-Means on the other hand has proven itself to be suitable for almost any problem, unless there is a great deal of noisy data to deal with, when initialized properly. K-Means initialization mainly consists of setting the number of clusters and selecting the initial centroid coordinates. Nevertheless, this initialization can be problematic.

Firstly, we shouldn't expect to always know the number of clusters. In a completely unsupervised problem, we can expect to have an embedding of our original data that goes through an embedding algorithm (such as Spectral embedding and PCA) or a deep network, but we are not however sure of the number of unique groups within the clusters. This problem of determining or estimating the number of clusters has always been open to discussion.

Secondly, the stochastic initialization makes the results unreliable at times, while making it very useful at specific cases. In the case of a very powerful implementation, and not a very large dataset, we can probably run K-Means or the initialized K-Means++ multiple times and find the best run, but that becomes problematic when dealing with larger sets of data.

In order to attempt to solve both problems, we proposed DISCERN which relies on diversity-based measures to estimate the centroids in a non-stochastic matter, yielding exactly the same results every time. While it is doing that, it can be limited to a specific number of clusters, or it can be set to estimate a number of clusters itself. When compared to other methods that estimate the number of clusters and have been widely used in practice, we found that only two methods in particular stand out and come close to this method in terms of correctly estimating this number. Nevertheless, we also proved our lower time complexity compared to these methods.

If you're interested, you can read the manuscript on arXiv: https://arxiv.org/abs/1910.05933

Stay tuned for more!