Semanlink - [2010.12566] DICT-MLM: Improved Multilingual Pre-Training using Bilingual Dictionaries

[2010.12566] DICT-MLM: Improved Multilingual Pre-Training using Bilingual Dictionaries

Tags:

About This Document

sl:arxiv_author :
sl:arxiv_firstAuthor : Aditi Chaudhary
sl:arxiv_num : 2010.12566
sl:arxiv_published : 2020-10-23T17:53:11Z
sl:arxiv_summary : Pre-trained multilingual language models such as mBERT have shown immense gains for several natural language processing (NLP) tasks, especially in the zero-shot cross-lingual setting. Most, if not all, of these pre-trained models rely on the masked-language modeling (MLM) objective as the key language learning objective. The principle behind these approaches is that predicting the masked words with the help of the surrounding text helps learn potent contextualized representations. Despite the strong representation learning capability enabled by MLM, we demonstrate an inherent limitation of MLM for multilingual representation learning. In particular, by requiring the model to predict the language-specific token, the MLM objective disincentivizes learning a language-agnostic representation -- which is a key goal of multilingual pre-training. Therefore to encourage better cross-lingual representation learning we propose the DICT-MLM method. DICT-MLM works by incentivizing the model to be able to predict not just the original masked word, but potentially any of its cross-lingual synonyms as well. Our empirical analysis on multiple downstream tasks spanning 30+ languages, demonstrates the efficacy of the proposed approach and its ability to learn better multilingual representations.@en
sl:arxiv_title : DICT-MLM: Improved Multilingual Pre-Training using Bilingual Dictionaries@en
sl:arxiv_updated : 2020-10-23T17:53:11Z
sl:bookmarkOf : https://arxiv.org/abs/2010.12566
sl:creationDate : 2021-09-06
sl:creationTime : 2021-09-06T18:27:44Z

File info

Bookmark of: https://arxiv.org/abs/2010.12566

Documents with similar tags (experimental)

[2404.11018] Many-Shot In-Context Learning

Tags:

2024-04-21 About

[2307.15936] A Theory for Emergence of Complex Skills in Language Models

Tags:

2024-02-24 About

[2305.14128] Dr.ICL: Demonstration-Retrieved In-context Learning

Tags:

2023-07-14 About

[2305.11778] Cross-Lingual Supervision improves Large Language Models Pre-training

Tags:

2023-05-22 About

[2305.06897] AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

Tags:

2023-05-15 About

[2303.16839] MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Tags:

2023-04-25 About

[2304.01982] Rethinking the Role of Token Retrieval in Multi-Vector Retrieval

Tags:

2023-04-05 About

[2112.05682] Self-attention Does Not Need O(n^2) Memory

Tags:

2023-02-27 About

[2203.14465] STaR: Bootstrapping Reasoning With Reasoning

Tags:

2023-02-07 About

[2202.06991] Transformer Memory as a Differentiable Search Index

Tags:

2022-10-25 About

[2203.09435] Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation

Tags:

2022-09-08 About

[2206.10658] Questions Are All You Need to Train a Dense Passage Retriever

Tags:

> **approach for training dense retrieval models that does not require any labeled training data**. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples.
>
> ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer documents).
>
> It uses a new document-retrieval autoencoding scheme, where
> 1. an input question is used to retrieve a set of evidence documents, and
> 2. the documents are then used to compute the probability of reconstructing the original question.
>
> Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning.

[Tweet](doc:2022/07/devendra_singh_sachan_sur_twitt)

> Given an
input question, ART first retrieves a small set
of possible evidences documents. It then recon
structs
the original question by attending to these
documents
>
> The
key idea in ART is to consider the retrieved documents
as a noisy representation of the original
question and question reconstruction probability
as a way of denoising that provides soft-labels for
how likely each document is to have been the correct
result

Refers to [[IZACARD 2012.04584] Distilling Knowledge from Reader to Retriever for Question Answering](doc:2020/12/2012_04584_distilling_knowled)

2022-07-06 About

[2205.08184] SKILL: Structured Knowledge Infusion for Large Language Models

Tags:

2022-05-18 About

[2205.05131] Unifying Language Learning Paradigms

Tags:

2022-05-12 About

[2203.08913] Memorizing Transformers

Tags:

2022-05-07 About

[1906.00300] Latent Retrieval for Weakly Supervised Open Domain Question Answering

Tags:

2022-01-11 About

[2106.04647] Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Tags:

2021-09-29 About

[2107.00676] A Primer on Pretrained Multilingual Language Models

Tags:

2021-07-13 About

[2102.07043] Reasoning Over Virtual Knowledge Bases With Open Predicate Relations

Tags:

> a method for constructing **a virtual KB (VKB) trained entirely from text**

Open Predicate Query Language (OPQL): constructing a virtual knowledge base (VKB) that supports KB reasoning & open-domain QA, tackling the incompleteness of knowledge bases by constructing a virtual KB only from text

> OPQL constructs
a VKB by **encoding and indexing a set of
relation mentions** in a way that naturally enables
reasoning and can be trained without any structured
supervision.

> can be used
as an **external memory integrated into a language
model**

cf. this earlier paper [[2002.10640] Differentiable Reasoning over a Virtual Knowledge Base](doc:2020/07/2002_10640_differentiable_rea). But does not require an initial structured KB for distant
supervision.

> The key idea in constructing the OPQL VKB is to use a
dual-encoder pre-training process, similar to 
[[1906.03158] Matching the Blanks: Distributional Similarity for Relation Learning](doc:2021/05/1906_03158_matching_the_blank)

Related work section refers to [[1909.04164] Knowledge Enhanced Contextual Word Representations](doc:2020/05/1909_04164_knowledge_enhanced). Also refers to [[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge](doc:2020/07/2007_00849_facts_as_experts_) (some authors in common)

2021-06-20 About

[1906.03158] Matching the Blanks: Distributional Similarity for Relation Learning

Tags:

2021-05-13 About

[1909.10506] Learning Dense Representations for Entity Retrieval

Tags:

> We show that it is feasible to perform **entity
linking by training a dual encoder (two-tower)
model that encodes mentions and entities in
the same dense vector space**, where candidate
entities are retrieved by approximate nearest
neighbor search. Unlike prior work, **this setup
does not rely on an alias table followed by a
re-ranker, and is thus the first fully learned entity
retrieval model**.

Contributions:

> -  a dual encoder architecture for
learning entity and mention encodings suitable for
retrieval. A key feature of the architecture is that it
employs a modular **hierarchy of sub-encoders that
capture different aspects of mentions and entities**
> - a simple, fully unsupervised **hard negative
mining** strategy that produces massive gains
in retrieval performance, compared to using only
random negatives
> - high
quality candidate entities very efficiently using approximate nearest neighbor search
> - outperforms discrete retrieval
baselines like an alias table or BM25

> strong retrieval
performance across all 5.7 million Wikipedia entities in
around 3ms per mention

> since we are using a two-tower or dual
encoder architecture, **our model cannot use any kind of attention over
both mentions and entities at once**, nor feature-wise
comparisons as done by Francis-Landau et al. (2016).
This is a fairly severe constraint – for example, **we cannot
directly compare the mention span to the entity title**
– but it permits retrieval with nearest neighbor search
for the entire context against a single, all encompassing
representation for each entity

2021-05-01 About

[1902.00751] Parameter-Efficient Transfer Learning for NLP

Tags:

2021-04-11 About

[2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training

Tags:

**Augment language model pre-training with a retriever module**, which
is trained using the masked language modeling objective.

> To capture knowledge in a more modular and interpretable way, we augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. **For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner**, using masked language modeling as the learning signal and backpropagating through a retrieval step that considers millions of documents

Hum, #TODO: parallel to be drawn with techniques in [KG-augmented Language Models](tag:knowledge_graph_augmented_language_models) which focus "on the problem of capturing declarative knowledge in the learned parameters of a language model."

[Google AI Blog Post](doc:2020/08/google_ai_blog_realm_integrat)

[Summary](https://joeddav.github.io/blog/2020/03/03/REALM.html) for the [Hugging Face awesome-papers reading group](doc:2021/03/huggingface_awesome_papers_pap)

2020-12-12 About

[2004.07202] Entities as Experts: Sparse Memory Access with Entity Supervision

Tags:

2020-07-11 About

[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge

Tags:

> a neural language model that includes **an explicit interface between symbolically interpretable factual information and subsymbolic neural knowledge.**... **The model can be updated without re-training by manipulating its symbolic representations**. In particular this model allows us to add new facts and overwrite existing ones.

> a **neural language model which learns to access information
in a symbolic knowledge graph.**

> This
model builds on the recently-proposed [Entities as
Experts](doc:2020/07/2004_07202_entities_as_expert) (EaE) language model (Févry et al., 2020),
which extends the same transformer (Vaswani
et al., 2017) architecture of BERT (Devlin et al., 2019) with an additional external memory for entities.
>
> After training EaE, the embedding associated
with an entity will (ideally) capture information
about the textual context in which that
entity appears, and by inference, the entity’s semantic
properties
>
> we include an additional
memory called a fact memory, which encodes
triples from a symbolic KB.
>
> This combination results in a
neural language model which learns to access information
in a the symbolic knowledge graph.

TODO:

- read again IBM's [Span Selection Pre-training for Question Answering](doc:2019/09/_1909_04120_span_selection_pre) ("an effort to avoid encoding general knowledge in the transformer network itself")
- compare with [[1907.05242] Large Memory Layers with Product Keys](doc:2019/07/_1907_05242_large_memory_layer)
- how does it relate with [[2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training](doc:2020/12/2002_08909_realm_retrieval_a)?

2020-07-09 About

[1906.02715] Visualizing and Measuring the Geometry of BERT

Tags:

2019-06-07 About

[1803.02893] An efficient framework for learning sentence representations

Tags:

2019-03-20 About