Semanlink - Latent semantic indexing ("Introduction to Information Retrieval" Manning 2008)

Latent semantic indexing ("Introduction to Information Retrieval" Manning 2008)

Tags:

VSM : problem with synonymy and polysemy (eg. synonyms are accorded separate dimensions)

Could we use the co-occurrences of terms to capture the latent semantic associations of terms and alleviate these problems?

Concluding remarks:

- computational cost of the SVD is significant
    - biggest obstacle to the widespread adoption to LSI.
    - One approach to this obstacle: build the LSI representation on a randomly sampled subset of the documents, following which the remaining documents are ``folded in'' (cf Gensim tutorial "[Random Projection (used as an option to speed up LSI)](https://radimrehurek.com/gensim/models/rpmodel.html)")
- As we reduce k, recall tends to increase, as expected.
- **Most surprisingly**, a value of k in the low hundreds can actually increase precision. **This appears to suggest that for a suitable value of *k*, LSI addresses some of the challenges of synonymy**.
- LSI works best in applications where there is little overlap between queries and documents. (--??)

The experiments also documented some modes where LSI failed to match the effectiveness of more traditional indexes and score computations.

LSI shares two basic drawbacks of vector space retrieval:
    
- no good way of expressing negations
- no way of enforcing Boolean conditions.

LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.

About This Document

sl:creationDate : 2017-07-19
sl:creationTime : 2017-07-19T09:54:04Z

File info

Bookmark of: https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-indexing-1.html