Effect of Non-linear Deep Architecture in Sequence Labeling(About) > we show the close connection between CRF and “sequence model” neural nets, and present an empirical investigation to compare their performance on two sequence labeling tasks – Named Entity Recognition and Syntactic Chunking. Our results suggest that **non-linear models are highly effective in low-dimensional distributional spaces. Somewhat surprisingly, we find that a non-linear architecture offers no benefits in a high-dimensional discrete feature space**.
Representations for Language: From Word Embeddings to Sentence Meanings (2017) - YouTube(About) [Slides](/doc/?uri=https%3A%2F%2Fnlp.stanford.edu%2Fmanning%2Ftalks%2FSimons-Institute-Manning-2017.pdf)
What's special about human language? the only hope for explainable intelligence.
Symbols are not just an invention of logic / classical AI.
Meaning: a solution via distributional similarity based representations. One of the most successfull ideas of modern NLP.
> You shall know a word by the company it keeps (JR Firth 1957)
The BiLSTM hegemony
Neural Bag of words
> "Surprisingly effective for many tasks :-(" [cf "DAN", Deep Averaging Network, Iyyver et al.](/doc/?uri=http%3A%2F%2Fwww.cs.cornell.edu%2Fcourses%2Fcs5740%2F2016sp%2Fresources%2Fdans.pdf)
Christopher Manning - "Building Neural Network Models That Can Reason" (TCSDLS 2017-2018) - YouTube(About) Goal: to enhance DL systems with reasoning capabilities from the ground-up
- allowing them to perform transparent multi-step reasoning processes
- while retaining end-to-end differentiability and scalability to real-world problems
> I get the feeling that if we're going to make further progress in AI, we actually have to get back to some of these problems of knowledge representation reasoning
- From ML to machine reasoning
- the CLEVR task
- Memory-Attention-Composition Networks
What is reasoning? (Bottou 2011)
- manipulating previously acquired knowledge in order to answer a question
- not necessarily achieved by making logical inference (eg: algebraic manipulations of matrices)
- composition rules -> combination of operations to address new tasks
Latent semantic indexing ("Introduction to Information Retrieval" Manning 2008)(About) VSM : problem with synonymy and polysemy (eg. synonyms are accorded separate dimensions)
Could we use the co-occurrences of terms to capture the latent semantic associations of terms and alleviate these problems?
- computational cost of the SVD is significant
- biggest obstacle to the widespread adoption to LSI.
- One approach to this obstacle: build the LSI representation on a randomly sampled subset of the documents, following which the remaining documents are ``folded in'' (cf Gensim tutorial "[Random Projection (used as an option to speed up LSI)](https://radimrehurek.com/gensim/models/rpmodel.html)")
- As we reduce k, recall tends to increase, as expected.
- **Most surprisingly**, a value of k in the low hundreds can actually increase precision. **This appears to suggest that for a suitable value of *k*, LSI addresses some of the challenges of synonymy**.
- LSI works best in applications where there is little overlap between queries and documents. (--??)
The experiments also documented some modes where LSI failed to match the effectiveness of more traditional indexes and score computations.
LSI shares two basic drawbacks of vector space retrieval:
- no good way of expressing negations
- no way of enforcing Boolean conditions.
LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.