Semanlink - NLP@IBM

> **One-sentence Summary**: we suggest adding an unsupervised intermediate classification step, before finetunning and after pretraining BERT, and show it improves performance for data-constrained cases.

> for text classification cold start (when labeled
data is scarce), **add an intermediate unsupervised
classification task**, between the pretraining
and fine-tuning phases:
> perform clustering and
train the pre-trained model on predicting the
cluster labels.

> this additional
classification phase can significantly improve
performance, mainly for **topical classification**
tasks

> we use an efficient clustering technique,
that relies on simple Bag Of Words (BOW)
representations, to partition the unlabeled training
data into relatively homogeneous clusters of text
instances.
>
> Next, we treat these clusters as labeled
data for an intermediate text classification task, and
train the pre-trained model – with or without additional
MLM pretraining – with respect to this
multi-class problem, prior to the final fine-tuning
over the actual target-task labels

> The underlying
intuition is that inter-training the model
over a related text classification task would be more
beneficial compared to MLM inter-training, which
focuses on different textual entities, namely predicting
the identity of a single token.

2022-04-06 About

Leshem Choshen sur Twitter : "Labelled data is scarce, what can we do?..."

Tags:

2022-04-06 About

[2108.13934] Robust Retrieval Augmented Generation for Zero-shot Slot Filling

Tags:

2022-01-19 About

A Survey of Text Clustering Algorithms - C. C. Aggarwal (2012)

Tags:

2021-04-20 About

IBM Research addressing Enterprise NLP challenges in 2020

Tags:

2020-06-12 About

Matching Resumes to Jobs via Deep Siamese Network | Companion Proceedings of the The Web Conference 2018

Tags:

2020-02-10 About

Advancing Natural Language Processing (NLP) for Enterprise Domains

Tags:

2020-01-07 About

Project Debater - IBM Research AI

Tags:

2019-11-06 About

[1909.04120] Span Selection Pre-training for Question Answering

Tags:

> a **new pre-training task inspired by reading
comprehension** and an **effort to avoid encoding general knowledge in the transformer network itself**

Current transformer architectures store general knowledge -> large models, long pre-training time. Better to offload the requirement of general knowledge to a sparsely activated network.

"Span selection" as an additional auxiliary task: the query is a sentence drawn from a corpus
with a term replaced with a special token: [BLANK]. The term replaced by the blank is the answer term. The passage is
relevant as determined by a BM25 search, and answer-bearing (containing the answer
term). Unlike BERT’s cloze task, where the answer must be drawn from the model itself, the answer is found in a passage
using language understanding.

> **We hope to progress to a model of general purpose language modeling that uses an indexed long
term memory to retrieve world knowledge, rather than holding it in the densely activated transformer encoder layers.**

2019-09-18 About

Word Mover's Embedding: From Word2Vec to Document Embedding (2018)

Tags:

2018-11-10 About