Semi-supervised Clustering for Short Text via Deep Representation Learning (2016)(About) semi-supervised method for short text clustering, where we represent texts as distributed vectors with neural networks, and use a small amount of labeled data to specify our intention for clustering. We design a novel objective to combine the representation learning process and the k-means clustering process together, and optimize the objective with both labeled data and unlabeled data iteratively until convergence through three steps:
1. assign each short text to its nearest centroid based on its representation from the current neural networks;
2. re-estimate the cluster centroids based on cluster assignments from step (1);
3. update neural networks according to the objective by keeping centroids and cluster assignments fixed.
Semantic Enriched Short Text Clustering | SpringerLink(About) the issue of clustering short texts, which are free answers gathered during brain storming seminars. Those answers are short, often incomplete, and highly biased toward the question, so establishing a notion of proximity between texts is a challenging task. In addition, the number of answers is counted up to hundred instances, which causes sparsity. We present three text clustering methods in order to choose the best one for this specific task, then we show how the method can be improved by a semantic enrichment, including neural-based distributional models and external knowledge resources.
Self-Taught Convolutional Neural Networks for Short Text Clustering (2017)(About) > We propose a flexible short text clustering framework which explores the feasibility and effectiveness of combining CNN and traditional unsupervised dimensionality reduction methods.
> Non-biased deep feature representations can be learned through our self- taught CNN framework which does not use any external tags/labels or complicated NLP pre-processing.
> The original raw text features are firstly embedded into compact binary codes by using one existing unsupervised dimensionality reduction methods. Then, word embeddings are explored and fed into convolutional neural networks to learn deep feature representations, meanwhile the output units are used to fit the pre-trained binary codes in the training process. Finally, we get the optimal clusters by employing K-means to cluster the learned representations.
[conf paper, same authors](http://www.aclweb.org/anthology/W15-1509) ; [gitgub repo (matlab)](https://github.com/jacoxu/STC2)
Learning Semantic Similarity for Very Short Texts (Arxiv, submitted on 2 Dec 2015)(About) In order to pair short text
fragments—as a concatenation of separate words—an adequate
distributed sentence representation is needed. Main contribution: a first step towards a hybrid method that
combines the strength of dense distributed representations—
as opposed to sparse term matching—with the strength of
tf-idf based methods. The combination of word embeddings and tf-idf
information might lead to a better model for semantic content
within very short text fragments.
Short Text Similarity with Word Embeddings(About) We investigate whether determining short text similarity is possible
using only semantic features.
A novel feature of our
approach is that an arbitrary number of word embedding sets can be