These two key differences point toward some situations in which the semi-supervised approach is preferable. 1. In some clustering problems the desired similarity metric may be so different from the default that traditional active learning would make many inefficient queries. This problem also arises when there are many different plausible clusterings. Although less automated, a human browsing the data would do less work by selecting the feedback data points themself. 2. The intuitive array of possible constraints are easier to apply than labels, especially when the final clusters are not known in advance.

Learning regular sets from queries and counterexamples. Information and Computation, 75(2):87–106, 1987. [2] Chris Buckley and Gerard Salton. Optimization of relevance feedback weights. In Proceedings of the 18th Annual International Association for Computing Machinery (ACM) Special Interest Group on Informa- 30 Constrained Clustering: Advances in Algorithms, Theory, and Applications tion Retrieval Conference on Research and Development in Information Retrieval, pages 351–357. ACM Press, 1995. [3] Peter Cheeseman, James Kelly, Matthew Self, John Stutz, Will Taylor, and Don Freeman.

Gates, and Philip Yu consider the problem of using a pre-existing taxonomy of text documents as supervision in improving the clustering algorithm, which is subsequently used for classifying text documents into categories. In their experiments, they use the Yahoo! hierarchy as prior knowledge in the supervised clustering scheme, and demonstrate that the automated categorization system built by their technique can achieve equivalent (and sometimes better) performance compared to manually built categorization taxonomies at a fraction of the cost.

