Latent semantic indexing ("Introduction to Information Retrieval" Manning 2008)(About) VSM : problem with synonymy and polysemy (eg. synonyms are accorded separate dimensions)
Could we use the co-occurrences of terms to capture the latent semantic associations of terms and alleviate these problems?
- computational cost of the SVD is significant
- biggest obstacle to the widespread adoption to LSI.
- One approach to this obstacle: build the LSI representation on a randomly sampled subset of the documents, following which the remaining documents are ``folded in'' (cf Gensim tutorial "[Random Projection (used as an option to speed up LSI)](https://radimrehurek.com/gensim/models/rpmodel.html)")
- As we reduce k, recall tends to increase, as expected.
- **Most surprisingly**, a value of k in the low hundreds can actually increase precision. **This appears to suggest that for a suitable value of *k*, LSI addresses some of the challenges of synonymy**.
- LSI works best in applications where there is little overlap between queries and documents. (--??)
The experiments also documented some modes where LSI failed to match the effectiveness of more traditional indexes and score computations.
LSI shares two basic drawbacks of vector space retrieval:
- no good way of expressing negations
- no way of enforcing Boolean conditions.
LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.