Semanlink - [2007.15779] Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

[2007.15779] Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Tags:

About This Document

sl:arxiv_author :
sl:arxiv_firstAuthor : Yu Gu
sl:arxiv_num : 2007.15779
sl:arxiv_published : 2020-07-31T00:04:15Z
sl:arxiv_summary : Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.@en
sl:arxiv_title : Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing@en
sl:arxiv_updated : 2021-02-11T19:13:59Z
sl:bookmarkOf : https://arxiv.org/abs/2007.15779
sl:creationDate : 2021-04-11
sl:creationTime : 2021-04-11T16:38:59Z

File info

Bookmark of: https://arxiv.org/abs/2007.15779

Documents with similar tags (experimental)

[2002.06275] TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval

Tags:

2023-08-27 About

[2307.08621] Retentive Network: A Successor to Transformer for Large Language Models

Tags:

2023-07-20 About

[2306.07174] Augmenting Language Models with Long-Term Memory

Tags:

2023-06-13 About

[2305.15294] Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

Tags:

2023-05-26 About

[2106.09685] LoRA: Low-Rank Adaptation of Large Language Models

Tags:

2023-03-21 About

BioGPT

Tags:

2023-02-02 About

[2206.02743] A Neural Corpus Indexer for Document Retrieval

Tags:

2023-01-18 About

[2002.01808] K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters

Tags:

2023-01-12 About

[2212.02623] Unifying Vision, Text, and Layout for Universal Document Processing

Tags:

2022-12-07 About

[2008.12813] HittER: Hierarchical Transformers for Knowledge Graph Embeddings

Tags:

2022-06-30 About

[2004.05119] Beyond Fine-tuning: Few-Sample Sentence Embedding Transfer

Tags:

2022-03-31 About

[1911.02655] Towards Domain Adaptation from Limited Data for Question Answering Using Deep Neural Networks

Tags:

2021-11-19 About

[1706.03610] Neural Domain Adaptation for Biomedical Question Answering

Tags:

2021-11-19 About

[2106.13474] Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains

Tags:

2021-10-21 About

[2004.09095] The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Tags:

2021-10-03 About

exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources - ACL Anthology

Tags:

**Focus on the Embedding of Domain-specific Vocabulary.**

> exBERT
adds a new domain-specific vocabulary and the corresponding
embedding layer, as well as a small
extension module to the original unmodified model

> a pretraining
method allowing **low-cost embedding of
domain-specific vocabulary in the context of an
existing large pre-trained model such as BERT**

> exBERT... explicitly incorporates
the new domain’s vocabulary, while being able to
**reuse the original pre-trained model’s weights as is**
to reduce required computation and training data. Specifically, exBERT extends BERT by augmenting
its embeddings for the original vocabulary with
new embeddings for the domain-specific vocabulary
via **a learned small “extension” module**. **The
output of the original and extension modules are
combined via a trainable weighted sum operation**

In a way similar to concept developed in

> [[1902.00751] Parameter-Efficient Transfer Learning for NLP](doc:2021/04/1902_00751_parameter_efficien), but not in the fine-tuning paradigm.

[Github](https://github.com/cgmhaicenter/exBERT)

2021-04-11 About

[2009.02835] E-BERT: A Phrase and Product Knowledge Enhanced Language Model for E-commerce

Tags:

2020-12-14 About

[2001.09522] TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network

Tags:

how to add a set of new concepts to an existing taxonomy.

[Tweet](https://twitter.com/mickeyjs6/status/1253772146142216194?s=20) [GitHub](https://github.com/mickeystroller/TaxoExpan)

> we study the taxonomy expansion task: given an
existing taxonomy and a set of new emerging concepts, we aim
to automatically expand the taxonomy to incorporate these new
concepts (without changing the existing relations in the given taxonomy).

> To the best of our knowledge, this is the first study on **how to
expand an existing directed acyclic graph (as we model a taxonomy
as a DAG) using self-supervised learning**.

Self-supervised framework, the existing taxonomy being used as training data: it learns a model to predict whether a query concept is the direct hyponym of an anchor concept.

> 2 techniques:
>
> 1. a **position-enhanced graph neural network that encodes the local structure of an anchor concept** in the existing taxonomy,
> 2. a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data.

Regarding 1: uses [GNN](/tag/graph_neural_networks.html) to model the "ego network" of concepts (potential “siblings”
and “grand parents” of the query concept).

> Regular
GNNs fail to distinguish nodes with different relative positions to
the query (i.e., some nodes are grand parents of the query while
the others are siblings of the query). To address this limitation, we
present a simple but effective enhancement to inject such position
information into GNNs using position embedding. We show that
such embedding can be easily integrated with existing GNN architectures
(e.g., [GCN](/tag/graph_convolutional_networks) and GAT) and significantly boosts the
prediction performance

Regarding point 2: uses InfoNCE loss, cf. [Contrastive Predictive Coding](/doc/?uri=https%3A%2F%2Farxiv.org%2Fabs%2F1807.03748)

> Instead of predicting
whether each individual ⟨query concept, anchor concept⟩ pair
is positive or not, we first group all pairs sharing the same query
concept into a single training instance and learn a model to select
the positive pair among other negative ones from the group.

(Hum, ça me rappelle quelque chose)

> assume each concept (in existing taxonomy + set of new concepts) has an initial embedding
vector learned from some text associated with this concept.

To keep things tractable, only attempts to find a single parent node of each new concept.

2020-04-25 About

[2002.02925] BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

Tags:

2020-02-10 About

[2001.01447] Improving Entity Linking by Modeling Latent Entity Type Information

Tags:

2020-01-09 About

[1912.08904] Macaw: An Extensible Conversational Information Seeking Platform

Tags:

2020-01-01 About

[1901.11504] Multi-Task Deep Neural Networks for Natural Language Understanding

Tags:

2019-02-17 About

[1810.00438] Parameter-free Sentence Embedding via Orthogonal Basis

Tags:

2018-10-06 About