Technique of analyzing relationships between a set of documents and the terms they contain, by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text.
LSI transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality.
A matrix containing word counts (in lines) per paragraph (column) is constructed from a large piece of text. [Singular value decomposition (SVD)](singular_value_decomposition) is used to reduce the number of rows while preserving the similarity structure among columns. Similarities between words and/or docs can then be evaluated using cosine-distance in the low-dimensional space
- alleviate the problem of **synonymy** (note: wikipedia se contredit en ce qui concerne la polysémie. Je dirais que LSI ne peut pas régler ce pb)
- can output topics in a **ranked order**.
- **requires a num_topics parameter**.
- dimensions have no easily interpretable meaning in natural language
- SVD is computation intensive (still a pb with improved algos?)
- wikipedia says that the probabilistic model of LSA does not match observed data: LSA assumes that words and documents form a joint Gaussian model (ergodic hypothesis), while a Poisson distribution has been observed. Thus, a newer alternative is probabilistic latent semantic analysis, based on a multinomial model, which is reported to give better results than standard LSA
[Gensim tuto about transformations](https://markroxor.github.io/gensim/static/notebooks/Topics_and_Transformations.html) says that "LSI training is unique in that it can continue at any point, simply by providing more training documents."
(LSI or LSA ? Truncated SVD applied to document similarity is called Latent Semantic Indexing (LSI), but it is called Latent Semantic Analysis (LSA) when applied to word similarity.)
4 ways of looking at the Truncated SVD ([cf.](http://www.jair.org/media/2934/live-2934-4846-jair.pdf)) :
- Latent meaning: the truncated SVD creates a low-dimensional linear mapping between words in row space and context in columns which captures the hidden (latent) meaning in the words and contexts
- Noise reduction: the truncated SVD can be seen as a smoothed version of the original matrix ( which captures the signal and leaves out the noise)
- A way to discover high-order co-occurrence: when 2 words appear in similar context
- Sparsity reduction: the origin matrix is sparse, but the truncated SVD is dense. Sparsity may be viewed as a problem of insufficient data and truncated SVD as a way of simulating the missing text
[See also "Introduction to Information Retrieval" Manning 2008](https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-indexing-1.html)
Semantics with Dense Vectors(About) > We will introduce three methods of generating very dense, short vectors:
> 1. using dimensionality reduction methods like SVD,
> 2. using neural nets like the popular skip-gram or CBOW approaches.
> 3. a quite different approach based on neighboring words called Brown clustering.
Gensim tutorial: Similarity Queries(About) > "The thing to note here is that documents no. 2 would never be returned by a standard boolean fulltext search, because they do not share any common words with query string"
Latent semantic indexing ("Introduction to Information Retrieval" Manning 2008)(About) VSM : problem with synonymy and polysemy (eg. synonyms are accorded separate dimensions)
Could we use the co-occurrences of terms to capture the latent semantic associations of terms and alleviate these problems?
- computational cost of the SVD is significant
- biggest obstacle to the widespread adoption to LSI.
- One approach to this obstacle: build the LSI representation on a randomly sampled subset of the documents, following which the remaining documents are ``folded in'' (cf Gensim tutorial "[Random Projection (used as an option to speed up LSI)](https://radimrehurek.com/gensim/models/rpmodel.html)")
- As we reduce k, recall tends to increase, as expected.
- **Most surprisingly**, a value of k in the low hundreds can actually increase precision. **This appears to suggest that for a suitable value of *k*, LSI addresses some of the challenges of synonymy**.
- LSI works best in applications where there is little overlap between queries and documents. (--??)
The experiments also documented some modes where LSI failed to match the effectiveness of more traditional indexes and score computations.
LSI shares two basic drawbacks of vector space retrieval:
- no good way of expressing negations
- no way of enforcing Boolean conditions.
LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.