Semanlink - [2302.01398] The unreasonable effectiveness of few-shot learning for machine translation

[2302.01398] The unreasonable effectiveness of few-shot learning for machine translation

Tags:

About This Document

sl:arxiv_author :
sl:arxiv_firstAuthor : Xavier Garcia
sl:arxiv_num : 2302.01398
sl:arxiv_published : 2023-02-02T20:19:46Z
sl:arxiv_summary : We demonstrate the potential of few-shot translation systems, trained with unpaired language data, for both high and low-resource language pairs. We show that with only 5 examples of high-quality translation data shown at inference, a transformer decoder-only model trained solely with self-supervised learning, is able to match specialized supervised state-of-the-art models as well as more general commercial translation systems. In particular, we outperform the best performing system on the WMT'21 English - Chinese news translation task by only using five examples of English - Chinese parallel data at inference. Moreover, our approach in building these models does not necessitate joint multilingual training or back-translation, is conceptually simple and shows the potential to extend to the multilingual setting. Furthermore, the resulting models are two orders of magnitude smaller than state-of-the-art language models. We then analyze the factors which impact the performance of few-shot translation systems, and highlight that the quality of the few-shot demonstrations heavily determines the quality of the translations generated by our models. Finally, we show that the few-shot paradigm also provides a way to control certain attributes of the translation -- we show that we are able to control for regional varieties and formality using only a five examples at inference, paving the way towards controllable machine translation systems.@en
sl:arxiv_title : The unreasonable effectiveness of few-shot learning for machine translation@en
sl:arxiv_updated : 2023-02-02T20:19:46Z
sl:bookmarkOf : https://arxiv.org/abs/2302.01398
sl:creationDate : 2023-02-07
sl:creationTime : 2023-02-07T18:49:52Z

File info

Bookmark of: https://arxiv.org/abs/2302.01398

Documents with similar tags (experimental)

[2307.02486] LongNet: Scaling Transformers to 1,000,000,000 Tokens

Tags:

2023-07-06 About

[2305.07185] MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Tags:

2023-07-01 About

[2306.07536] TART: A plug-and-play Transformer module for task-agnostic reasoning

Tags:

2023-06-15 About

[2305.11778] Cross-Lingual Supervision improves Large Language Models Pre-training

Tags:

2023-05-22 About

[2202.06991] Transformer Memory as a Differentiable Search Index

Tags:

2022-10-25 About

[2209.11055] Efficient Few-Shot Learning Without Prompts

Tags:

2022-09-23 About

[2208.01066] What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

Tags:

2022-09-17 About

[2104.09224] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

Tags:

2022-09-16 About

[2209.01975] Selective Annotation Makes Language Models Better Few-Shot Learners

Tags:

2022-09-07 About

[2208.03299] Few-shot Learning with Retrieval Augmented Language Model

Tags:

2022-08-08 About

[2205.03983] Building Machine Translation Systems for the Next Thousand Languages

Tags:

2022-05-10 About

[2109.06270] STraTA: Self-Training with Task Augmentation for Better Few-shot Learning

Tags:

2022-04-14 About

[2004.07180] SPECTER: Document-level Representation Learning using Citation-informed Transformers

Tags:

2022-01-29 About

[2110.06176] Mention Memory: incorporating textual knowledge into Transformers through entity mention attention

Tags:

2021-10-13 About

[2010.02353] Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

Tags:

2021-08-25 About

[2010.06467] Pretrained Transformers for Text Ranking: BERT and Beyond

Tags:

2021-07-09 About

[2104.14690] Entailment as Few-Shot Learner

Tags:

2021-05-03 About

[2012.15723] Making Pre-trained Language Models Better Few-shot Learners

Tags:

2021-01-02 About

[2011.06993] FLERT: Document-Level Features for Named Entity Recognition

Tags:

2020-12-01 About

[1909.01259] Neural Attentive Bag-of-Entities Model for Text Classification

Tags:

A model that performs **text classification using entities in a knowledge base**.

> Entities provide unambiguous and relevant semantic signals that are beneficial for capturing semantics in texts. We combine **simple high-recall entity detection based on a dictionary** (word->list of entities), to detect entities in a document, with a novel neural **attention mechanism that enables the model to focus on a small number of unambiguous and relevant entities**.

2 steps:

1. Entity detection
2. Classification using the detected entities (+text) as inputs

Regarding entity linking, a local model which uses cosine
similarity between the embedding of the target
entity and the word-based representation of
the document to capture the relevance of an entity
given a document.

Embeddings from the KB: computed using [#Wikipedia2Vec](tag:wikipedia2vec) (similar words and entities
close to one another in a unified vector space)

Model using attention, with 2 features :

- cosine similarity between the
embedding of the entity and the word based
representation of the document
- the probability that the entity
name refers to the entity in KB.

Somewhat [related](doc:2020/01/investigating_entity_knowledge_)

### Conclusion:

>a neural
network model that performs text classification using
entities in Wikipedia. We combined simple
dictionary-based entity detection with a neural attention
mechanism to enable the model to focus
on a small number of unambiguous and relevant
entities in a document.

2020-09-02 About

[2004.07202] Entities as Experts: Sparse Memory Access with Entity Supervision

Tags:

2020-07-11 About

[1909.03193] KG-BERT: BERT for Knowledge Graph Completion

Tags:

2020-03-22 About

[2002.05867] Transformers as Soft Reasoners over Language

Tags:

2020-02-17 About

[1911.05507] Compressive Transformers for Long-Range Sequence Modelling

Tags:

2020-02-11 About

[1909.04120] Span Selection Pre-training for Question Answering

Tags:

> a **new pre-training task inspired by reading
comprehension** and an **effort to avoid encoding general knowledge in the transformer network itself**

Current transformer architectures store general knowledge -> large models, long pre-training time. Better to offload the requirement of general knowledge to a sparsely activated network.

"Span selection" as an additional auxiliary task: the query is a sentence drawn from a corpus
with a term replaced with a special token: [BLANK]. The term replaced by the blank is the answer term. The passage is
relevant as determined by a BM25 search, and answer-bearing (containing the answer
term). Unlike BERT’s cloze task, where the answer must be drawn from the model itself, the answer is found in a passage
using language understanding.

> **We hope to progress to a model of general purpose language modeling that uses an indexed long
term memory to retrieve world knowledge, rather than holding it in the densely activated transformer encoder layers.**

2019-09-18 About

[1909.03186] On Extractive and Abstractive Neural Document Summarization with Transformer Language Models

Tags:

2019-09-11 About

[1905.07129] ERNIE: Enhanced Language Representation with Informative Entities

Tags:

2019-08-05 About

[1907.05242] Large Memory Layers with Product Keys

Tags:

> **a structured memory which can be easily integrated into a neural network.** The memory is very large by design and therefore significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead. Its design and access pattern is based on **product keys**, which enable fast and exact nearest neighbor search. The ability to increase the number of parameters while keeping the same computational budget lets the overall system strike a better trade-off between prediction accuracy and computation efficiency both at training and test time.

> a key-value memory layer that can increase model capacity for a negligible computational cost. A 12-layer transformer with a memory outperforms a 24-layer transformer, and is 2x faster!

[Implementation](/doc/2019/08/product_key_memory_pkm_minima)

TODO: compare with [[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge](doc:2020/07/2007_00849_facts_as_experts_)

2019-07-13 About

[1903.05823] Deep Patent Landscaping Model Using Transformer and Graph Embedding

Tags: