Wikipedia Latent Semantic Analysis
Technique of analyzing relationships between a set of documents and the terms they contain, by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. LSI transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality. A matrix containing word counts (in lines) per paragraph (column) is constructed from a large piece of text. [Singular value decomposition (SVD)](singular_value_decomposition) is used to reduce the number of rows while preserving the similarity structure among columns. Similarities between words and/or docs can then be evaluated using cosine-distance in the low-dimensional space - pros: - alleviate the problem of **synonymy** (note: wikipedia se contredit en ce qui concerne la polysémie. Je dirais que LSI ne peut pas régler ce pb) - can output topics in a **ranked order**. - cons: - **requires a num_topics parameter**. - dimensions have no easily interpretable meaning in natural language - SVD is computation intensive (still a pb with improved algos?) - wikipedia says that the probabilistic model of LSA does not match observed data: LSA assumes that words and documents form a joint Gaussian model (ergodic hypothesis), while a Poisson distribution has been observed. Thus, a newer alternative is probabilistic latent semantic analysis, based on a multinomial model, which is reported to give better results than standard LSA [Gensim tuto about transformations]( says that "LSI training is unique in that it can continue at any point, simply by providing more training documents." (LSI or LSA ? Truncated SVD applied to document similarity is called Latent Semantic Indexing (LSI), but it is called Latent Semantic Analysis (LSA) when applied to word similarity.) 4 ways of looking at the Truncated SVD ([cf.]( : - Latent meaning: the truncated SVD creates a low-dimensional linear mapping between words in row space and context in columns which captures the hidden (latent) meaning in the words and contexts - Noise reduction: the truncated SVD can be seen as a smoothed version of the original matrix ( which captures the signal and leaves out the noise) - A way to discover high-order co-occurrence: when 2 words appear in similar context - Sparsity reduction: the origin matrix is sparse, but the truncated SVD is dense. Sparsity may be viewed as a problem of insufficient data and truncated SVD as a way of simulating the missing text [See also "Introduction to Information Retrieval" Manning 2008](
6 Documents (Long List