Learning Text Similarity with Siamese Recurrent Networks(About) A deep architecture for
**learning a similarity metric** on variable length
character sequences. The model
combines a stack of character-level bidirectional
LSTM’s with a Siamese architecture.
It learns to project variable length
strings into a fixed-dimensional embedding
space **by using only information
about the similarity between pairs of
strings**. This model is applied to the task
of job title normalization based on a manually
annotated taxonomy. A small data set
is incrementally expanded and augmented
with new sources of variance.
Unsupervised Similarity Learning from Textual Data (2012)(About) > Two main components of the model are a semantic interpreter of texts and a similarity function whose properties are derived from data. The first one associates particular documents with concepts defined in a knowledge base corresponding to the topics covered by the corpus. It shifts the representation of a meaning of the texts from words that can be ambiguous to concepts with predefined semantics. With this new representation, the similarity function is derived from data using a modification of the dynamic rule-based similarity model, which is adjusted to the unsupervised case.
By same author: [Interactive Document Indexing Method Based on Explicit Semantic Analysis](https://link.springer.com/chapter/10.1007/978-3-642-32115-3_18)
Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical Information Retrieval (2017)(About) > Transferring the success of word embeddings to Information Retrieval (IR) task is currently an active research topic. While embedding-based retrieval models could tackle the vocabulary mismatch problem by making use of the embedding’s inherent similarity between distinct words, most of them struggle to compete with the prevalent strong baselines such as TF-IDF and BM25.
Considering a practical ad-hoc IR task composed of two steps, matching and scoring, compares the performance of several techniques that leverage word embeddings in the retrieval models to compute the similarity between the query and the documents (namely word centroid similarity, paragraph vectors, Word Mover’s distance, as well as a novel inverse document frequency (IDF) re-weighted word centroid similarity).
> We confirm that word embeddings can be successfully employed in a practical information retrieval setting. The proposed cosine similarity of IDF re-weighted, aggregated word vectors is competitive to the TF-IDF baseline.
Finding Similar Items(About) **Jaccard similarity**: similarity of sets, based on the relative size of their intersection -> **finding textually similar documents in a large corpus, near duplicates**. [Collaborative Filtering](/tag/collaborative_filtering) as a Similar-Sets Problem (cf. online purchases, movie ratings)
**Shingling** turns the problem of textual similarity of documents into a pb of similarity of sets
k-shingle: substring of length k found within a document. k: 5 for emails. Hashing shingles. Shingles built from words (stop word + 2 following words)
Similarity-Preserving Summaries of Sets: shingles sets are large -> compress large sets into small representations (“signatures”) that preserve similarity: **[Minhashing](/tag/minhash)** - related to Jaccard similarity (good explanation in [wikipedia](https://en.wikipedia.org/wiki/MinHash))
It still may be impossible to find the pairs of docs with greatest similarity efficiently -> **[Locality-Sensitive Hashing](/tag/locality_sensitive_hashing)** for Documents
Theory of Locality-Sensitive Functions
LSH famiies for other distance measures
Applications of Locality-Sensitive Hashing:
- entity resolution
- matching fingerprints
- matching newpapers articles
Methods for High Degrees of Similarity: LSH-based methods most effective when the degree of similarity we
accept is relatively low. When we want to find sets that are almost identical, other methods can be faster.
Word Meaning and Similarity - Stanford University(About) thesaurus based meaning, Distributional models of meaning
Term-Context matrix. Term-document matrix: use tf-idf instead of raw term counts, for the term-context matrix, use Positive Pointwise Mutual Information (PPMI: Do words x and y co-occur more than if they were independent?)
Gensim tutorial: Similarity Queries(About) > "The thing to note here is that documents no. 2 would never be returned by a standard boolean fulltext search, because they do not share any common words with query string"
Short Text Similarity with Word Embeddings(About) We investigate whether determining short text similarity is possible
using only semantic features.
A novel feature of our
approach is that an arbitrary number of word embedding sets can be
Quick review on Text Clustering and Text Similarity Approaches(About) Author: Maali Mnasri (PhD @ CEA)
First transform text units to vectors? not always (eg. sentence similarity task using lexical word alignment). But vectors are efficient to process, and benefit from existing clustering algorithms such as k-means.
Sentence level or document level? Sentence clustering to summarise large documents.
Thematic clustering vs Semantic clustering: depends on the similarity measure.
Text similarity measures:
- Cosine similarity of tf-idf (suitable to produce thematic clusters)
- Knowledge-based Measures (wordNet) (quantify semantic relatedness of words),
- Word embedings
Examples, sample code:
- using wordnet with NLTK, and the formula to compute sentence similarities from word similarities.
- computing similarities between docs using gensim/word2vec
Which clustering algorithm?
- when we have an approximation of the clusters number, and when the similarity measure is not expensive in terms of computation time, clustering algo are suitable and fast. Sample code of k-means clustering using tf-idf vectors with scikit-learn
- Hierarchical clustering algorithms
- don't need to give the number of clusters
- but time consuming (calculate a similarity matrix for the sentences)
- for voluminous data, use an incremental clustering algorithm: sentences are processed one at a time ; each new sentence is compared to each of the already formed clusters.