Related Tags:
1 Documents (Short List
  • Latent semantic indexing ("Introduction to Information Retrieval" Manning 2008) (About)
    VSM : problem with synonymy and polysemy (eg. synonyms are accorded separate dimensions) Could we use the co-occurrences of terms to capture the latent semantic associations of terms and alleviate these problems? Concluding remarks: - computational cost of the SVD is significant - biggest obstacle to the widespread adoption to LSI. - One approach to this obstacle: build the LSI representation on a randomly sampled subset of the documents, following which the remaining documents are ``folded in'' (cf Gensim tutorial "[Random Projection (used as an option to speed up LSI)](") - As we reduce k, recall tends to increase, as expected. - **Most surprisingly**, a value of k in the low hundreds can actually increase precision. **This appears to suggest that for a suitable value of *k*, LSI addresses some of the challenges of synonymy**. - LSI works best in applications where there is little overlap between queries and documents. (--??) The experiments also documented some modes where LSI failed to match the effectiveness of more traditional indexes and score computations. LSI shares two basic drawbacks of vector space retrieval: - no good way of expressing negations - no way of enforcing Boolean conditions. LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.