AI@Google

LlamaIndex 🦙 sur X : "Fine-tuning Embedding Models for RAG with LoRA'

2024-04-23T23:20:13Z

[2404.11018] Many-Shot In-Context Learning

2024-04-21T13:25:46Z

[2307.15936] A Theory for Emergence of Complex Skills in Language Models

2024-02-24T00:11:29Z

[New Theory Suggests Chatbots Can Understand Text | Quanta Magazine](doc:2024/02/new_theory_suggests_chatbots_ca)

Jeff Dean (@🏡) sur X : "Gemini 1.5 Pro - A highly capable multimodal model with a 10M token context length..."

2024-02-15T22:26:23Z

An efficient long-text semantic retrieval approach via utilizing presentation learning on short-text | Complex & Intelligent Systems (2023)

2024-01-31T17:59:13Z

long-text retrieval model based on BERT (called LTR-BERT)

Rachit Bansal sur X : "An LLM can be efficiently composed with specialized (L)LMs to enable new tasks"

2024-01-06T12:07:15Z

[[2401.02412] LLM Augmented LLMs: Expanding Capabilities through Composition](doc:2024/01/2401_02412_llm_augmented_llms) > CALM—Composition to Augment Language Models: > 1. Scales up LLMs on new tasks by *re-using* existing (L)LMs w/ very few new parameters & data, > 2. Keeps existing model weights intact, hence **preserves original capabilities**, > 3. Applies to diverse domains and settings. > Rather than a shallow combination, CALM introduces a small set of cross-attention parameters over models’ layer representations. Use-case example, Multilinguality: > We reuse an LM trained on a bunch of low-resource languages (LRLs) w/ an LLM that has never seen some of these LRLs.

Maarten Grootendorst sur X : "BERTopic + LLMs + DataMapPlot"

2024-01-06T09:57:10Z

UKP Lab sur X : "a lightweight solution for few-shot domain-specific sentence classification: AdaSent!..."

2023-12-09T19:40:21Z

AdaSent is an approach to creating domain-specialized sentence encoders for few-shot sentence classification > Reusable general sentence adapter across domains > AdaSent decouples DAPT (Domain Adaptative Pre-Training) & SEPT (Sentence Embedding Pre-Training) **by storing the sentence encoding abilities into an adapter**, which is trained only once in the general domain and plugged into various DAPT-ed PLMs [Github](https://github.com/UKPLab/AdaSent)

Rethinking Query Expansion for BERT Reranking | Advances in Information Retrieval (2020)

2023-10-29T09:05:11Z

using BERT for Information Retrieval: > We find that traditional word-based query expansion is not entirely applicable

Maarten Grootendorst sur X : "Introducing KeyLLM. An extension to KeyBERT that can create, extract, and fine-tune keywords using Large Language Models!

2023-09-30T14:26:24Z

Getting started with DeepMatcher.ipynb - Colaboratory

2023-09-20T08:37:26Z

[2002.06275] TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval

2023-08-27T11:40:00Z

Modular and Parameter-Efficient Fine-Tuning for NLP Models

2023-08-08T09:16:37Z

SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval

2023-07-26T23:36:33Z

retrieval model that learns sparse lexical representations with contextual embeddings > we **combine the strengths of both the sparse and dense representations** for first-stage retrieval. > > Compared with [SPLADE](tag:splade), our model leverages the contextual embeddings to improve model expressiveness. Compared with [ColBERT](tag:colbert), our sparse representations are trained end-to-end to optimize both efficiency and effectiveness.

[2305.14128] Dr.ICL: Demonstration-Retrieved In-context Learning

2023-07-14T12:25:23Z

> While early studies primarily used a fixed or random set of demonstrations for all test queries, recent research suggests that retrieving semantically similar demonstrations to the input from a pool of available demonstrations results in better performance. This work expands the applicability of retrieval-based ICL approaches by demonstrating that even simple word-overlap similarity measures such as BM25 outperform randomly selected demonstrations.

Generative AI support on Vertex AI generally available | Google Cloud Blog

2023-06-09T08:21:29Z

Daniel Daza sur Twitter : "BioBLP, a method for learning embeddings on multimodal knowledge graphs...."

2023-06-07T23:35:23Z

[2305.11778] Cross-Lingual Supervision improves Large Language Models Pre-training

2023-05-22T08:13:33Z

> We demonstrate that pre-training Large Language Models on a mixture of a self-supervised Language Modeling objective and the supervised Machine Translation objective, therefore including cross-lingual parallel data during pre-training, yields models with better in-context learning abilities.

Peter J. Liu sur Twitter : "RLHF-alternative without RL"

2023-05-18T09:53:46Z

> TL;DR: Works as well as RLHF, but a lot simpler. About as easy and efficient as fine-tuning. Much better than simply fine-tuning on good examples.

[2305.06897] AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

2023-05-15T15:51:16Z

[Twitter](https://twitter.com/j___y_t/status/1657392003666128896)

Google AI PaLM 2 – Google AI

2023-05-15T09:11:10Z

Google teases Project Tailwind — a prototype AI notebook that learns from your documents - The Verge

2023-05-14T10:43:45Z

skeskinen/bert.cpp: ggml implementation of BERT

2023-05-09T00:29:27Z

> ggml inference of BERT neural net architecture with pooling and normalization from SentenceTransformers (sbert.net). High quality sentence embeddings in pure C++ (with C API). > > The main goal of bert.cpp is to run the BERT model using **4-bit integer quantization on CPU**

Document AI | Google for Developers - Software Development Guides, Tools & More | Google Developers

2023-05-09T00:07:28Z

Niels Rogge sur Twitter : "Made some new demo notebooks! - fine-tune @MetaAI's SAM and @GoogleAI's Pix2Struct on custom data"

2023-05-09T00:00:25Z

Google "We Have No Moat, And Neither Does OpenAI"

2023-05-04T21:46:16Z

> low-cost public involvement was enabled by a vastly cheaper mechanism for fine tuning called low rank adaptation ()[LoRA](tag:lora) > **Part of what makes LoRA so effective is that ... it’s stackable.** > > By contrast, training giant models from scratch not only throws away the pretraining, but also any iterative improvements that have been made on top. > LoRA updates are very cheap to produce (~$100) for the most popular model sizes. > Many of these projects are saving time by training on small, highly curated datasets... > These datasets are built using synthetic methods (e.g. filtering the best responses from an existing model) and scavenging from other projects > Directly Competing With Open Source Is a Losing Proposition > Paradoxically, the one clear winner in all of this is Meta. Because the leaked model was theirs ([LLaMA](tag:llama)), they have effectively garnered an entire planet's worth of free labor. Since most open source innovation is happening on top of their architecture, there is nothing stopping them from directly incorporating it into their products.

Aran Komatsuzaki sur Twitter : "JaxPruner: A concise library for sparsity research An open-source JAX-based pruning and sparse training library for machine learning research repo"

2023-04-28T07:58:57Z

[2303.16839] MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

2023-04-25T00:33:41Z

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, the common knowledge has it that the two most popular multimodal tasks, the generative and contrastive tasks, tend to conflict with one another, are hard to accommodate in one architecture, and further need complex adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks Related work: [CLIP: Connecting Text and Images](doc:2021/01/clip_connecting_text_and_images)

Domain Adaptation with Generative Pseudo-Labeling (GPL) | Pinecone

2023-04-09T10:30:34Z

Classifying long textual documents (up to 25 000 tokens) using BERT | by Sinequa | (2020)

2023-04-07T11:37:12Z

> long text + additional textual metadata (such as title, abstract …) and categories (location, authors …).

Diffusion language models – Sander Dieleman

2023-04-06T08:23:59Z

> Diffusion models have completely taken over generative modelling of perceptual signals -- why is autoregression still the name of the game for language modelling? And can we do anything about that?

Daniel Vila Suero sur Twitter : "Data quality is key for LLMs, but we're building Open Source LLMs with data of "unknown" quality... Introducing Alpaca GarbageCollector..."

2023-04-05T18:37:29Z

> a cross-lingual SetFit model to identify potential bad instructions in Alpaca-like datasets

[2304.01982] Rethinking the Role of Token Retrieval in Multi-Vector Retrieval

2023-04-05T08:33:18Z

> Multi-vector retrievers like [ColBERT](tag:colbert) are powerful, but they come at the cost of complicated inference. In this paper, we ask: "can token retrieval alone achieve great performance in multi-vector retrieval?" [tweet](https://twitter.com/leejnhk/status/1643632578824396805?s=20) > The key insight of XTR is that the token-retrieval in multi-vector models should be **trained to retrieve the most salient and informative document tokens**, so that the score between a query and document can be computed using only the retrieved information, just like how single-vector retrieval models work > This is an *amazing* way to re-engineer the scoring mechanism of late interaction / ColBERT retrievers! [src: ColBERT's author Omar Khattab](https://twitter.com/lateinteraction/status/1643439889902637056?s=20) - scoring using only retrieved document terms - imputing missing token scores using their upper bound

Niels Rogge sur Twitter : "@GoogleAI's Pix2Struct now available in 🤗 Transformers!"

2023-03-27T23:15:25Z

> A Transformer (vision encoder, language decoder). No OCR involved!. Pre-trained in a self-supervised fashion by predicting HTML based on masked portions of web page images. > Pix2Struct has been fine tuned on a variety of tasks and datasets, ranging from image captioning, visual question answering (VQA) over different inputs (books, charts, science diagrams), captioning UI components etc. ... We therefore advise you to use these models for the tasks they have been fine tuned on. > very similar to GPT-4's visual abilities, but open-source ;)

Enabling Python VirtualEnv in JupyterLab | My Shitty Code

2023-03-08T13:59:47Z

[2112.05682] Self-attention Does Not Need O(n^2) Memory

2023-02-27T12:58:02Z

[2108.08877] Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

2023-02-17T18:20:47Z

Maarten Grootendorst sur Twitter : "The v0.14 release of BERTopic is here. Fine-tune your topic keywords and labels with models from @OpenAI, @huggingface, @CohereAI, @spacy_io, and @LangChainAI... An overview thread"

2023-02-15T13:56:16Z

Jim Fan sur Twitter : "Do you know that DeepMind has actually open-sourced the heart of AlphaGo & AlphaZero?... "

2023-02-15T10:20:43Z

Guiding Frozen Language Models with Learned Soft Prompts – Google AI Blog

2023-02-14T10:42:51Z

[2203.14465] STaR: Bootstrapping Reasoning With Reasoning

2023-02-07T16:40:38Z

"Self-Taught Reasoner" (STaR) > (to our knowledge) the first technique to allow a pre-trained large language model to iteratively use its language modeling capacity to improve itself > Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose **a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales**, to bootstrap the ability to perform successively more complex reasoning.

Google announces ChatGPT rival Bard, with wider availability in ‘coming weeks’ - The Verge

2023-02-07T08:03:58Z

Ramsri Goutham Golla sur Twitter : "The most practical open-source competitor to @OpenAI 's GPT-3 is Google's Flan-T5 Here are 5 Flan-T5 resources to try out easily, deploy, or fine-tune it! 🧵" / Twitter

2023-02-04T02:04:59Z

The Flan Collection: Advancing open source methods for instruction tuning – Google AI Blog

2023-02-02T09:14:36Z

> The ability to reason on new tasks is mostly credited to training models on a wide variety of unique instructions, known as “instruction tuning”, which was introduced by FLAN and extended in T0, Super-Natural Instructions, MetaICL, and InstructGPT.

Shayne Longpre sur Twitter : "What’s the best completely public competitor to #ChatGPT? Flan-T5 beats all public models we tested..."

2023-02-01T18:29:11Z

> It's promising these results don't use any [#RLHF](tag:reinforcement_learning_from_human_feedback) data, or human "alignment", which is expensive to collect and less publicly available. > Key takeaway: finetuning Flan-T5 is better and more compute-efficient than finetuning T5.[src](https://twitter.com/_jasonwei/status/1620864198262804481?s=20&t=hMXLCdqcOFAEbjsfwc_yog)

Créer un notebook JupyterLab Vertex AI | Google Cloud

2023-01-31T01:54:30Z

5 steps to go from a notebook to a deployed model — The TensorFlow Blog

2023-01-31T01:45:52Z

> how to get from notebook experimentation to deployment in the cloud **notebook execution feature**: run the notebook cell by cell on the Vertex AI managed training service. When you launch the training job, it’s going to run on a machine you won’t have access to after the job completes -> have to save to a bucket Launch the execution: Select the Execute button, give your execution a name, **then add a GPU**. > Now you know how to quickly launch serverless training jobs on Google Cloud - **Deploy to an endpoint** - or use the **batch prediction feature** (if your use case does not require low latency predictions) Get predictions > Now that this model is deployed to an endpoint, you can hit it like any other REST endpoint

LaMDA: our breakthrough conversation technology

2023-01-28T15:20:18Z

An empirical analysis of compute-optimal large language model training

2023-01-26T23:33:11Z

> the current large language models are far too large for their compute budget and are not being trained on enough data.

Characterizing Emergent Phenomena in Large Language Models – Google AI Blog

2023-01-26T09:28:43Z

[Tweet](https://twitter.com/_jasonwei/status/1618331876623523844?s=20&t=sMbTCnu16Od8vGBmo0x6ig) > unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

[2301.08210] Everything is Connected: Graph Neural Networks

2023-01-21T14:01:42Z

> **it is likely that the very cognition processes driving our reasoning and decision-making are, in some sense, graph-structured.** That is, paraphrasing a quote from Forrester (1971), nobody really imagines in their head all the information known to them; rather, they imagine only selected concepts, and relationships between them, and use those to represent the real system. (yep, that's why I made semanlink) > Transformers are themselves a special case of GNNs

Multilingual Sentence Transformers | Pinecone

2023-01-13T01:45:12Z

Focus on **Multilingual Knowledge Distillation** > recent method introduced by Nils Reimers and Iryna Gurevych in 2020 > The teacher model is an already fine-tuned sentence transformer used for creating embeddings in a single language (most likely English). The student model is a transformer that has been pretrained on a multilingual corpus.

AlphaFold’s new rival? Meta AI predicts shape of 600 million proteins

2023-01-11T19:26:06Z

Andrej Karpathy sur Twitter : "Great post (5mo ago) "chinchilla's wild implications" giving context to LLM goldrush shifting from model size to dataset size..."

2023-01-05T00:53:48Z

Rohan Anil sur Twitter : "Next big jump with Neural Network performance is going to happen when community embraces non-uniformity

2022-12-18T10:02:21Z

ValueError "invalid literal for int() with base 10" in trainer.evaluate (dataset created from pandas) · Issue #228 · huggingface/setfit

2022-12-13T11:46:14Z

see > Note: some datasets on the Hugging Face Hub don't have a ClassLabel feature for the label column. In these cases, you should compute the candidate labels manually by first computing the id2label mapping as follows:

Few-Shot Text Classification (Cloudera 2020)

2022-11-24T14:16:39Z

> Sentence-BERT has been optimized… well, for sentences! It’s reasonable to suspect that SBERT’s representations of single words or short phrases like “Business” or “Science & Technology” won’t be as semantically relevant as representations derived from a word-level method, like word2vec or GloVe

One of the Biggest Problems in Biology Has Finally Been Solved - Scientific American

2022-11-01T09:45:44Z

Google DeepMind CEO Demis Hassabis explains how its AlphaFold AI program predicted the 3-D structure of every known protein

[2202.06991] Transformer Memory as a Differentiable Search Index

2022-10-25T00:04:06Z

> In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process.

Tutorial on Uncertainty Estimation for NLP

2022-10-18T15:02:39Z

Stephanie Chan sur Twitter : "Transformer inductive biases..."

2022-10-14T15:49:40Z

> Transformers generalize differently from information stored in: > >- weights - mostly "rule-based" >- context - mostly "exemplar-based" > >This effect depends on (a) the training data (b) the size of the transformer

Lewis Tunstall sur Twitter : "The SetFit library for few-shot learning with Sentence Transformers now supports multi-label text classification..."

2022-10-14T15:24:53Z

Multilabel support [github issue](https://github.com/huggingface/setfit/issues/65)

huggingface/setfit: Efficient few-shot learning with Sentence Transformers

2022-10-12T23:41:16Z

Santiago sur Twitter : "If you have an Apple M1 or M2 and don't take advantage of its GPU, I'm about to change your life..."

2022-10-07T19:33:41Z

> These instructions allow TensorFlow to use your GPU

MaartenGr/KeyBERT: Minimal keyword extraction with BERT

2022-10-06T14:37:52Z

Yi Tay sur Twitter : "Don't retrieve, recite!..."

2022-10-06T01:47:13Z

> Introducing Recitation-Augmented Language models "RECITE" from @GoogleAI

[2205.11498] Domain Adaptation for Memory-Efficient Dense Retrieval

2022-09-26T17:46:39Z

Refers to [Binary Passage Retriever (BPR)](doc:2021/06/2106_00882_efficient_passage_)

[2209.11055] Efficient Few-Shot Learning Without Prompts

2022-09-23T10:26:46Z

[tweet](https://twitter.com/_akhaliq/status/1573109469646561280?s=20&t=RTpK9dh90az0zT1Xg2ohpQ): > So if I have 4 classes and say 2 labels per class, I would first fine tune an ST on these 4 pairs and then vectorize the 8 total examples for fine-tuning the classifier

Google AI Blog: TensorStore for High-Performance, Scalable Array Storage

2022-09-23T02:24:43Z

Use Case: 3D Brain Mapping

PromptBERT improving BERT sentence embeddings with prompts - Ethan Kim

2022-09-16T10:31:11Z

[2201.04337] PromptBERT: Improving BERT Sentence Embeddings with Prompts

2022-09-16T10:06:59Z

[PromptBERT improving BERT sentence embeddings with prompts - Ethan Kim](doc:2022/09/promptbert_improving_bert_sente)

Prompt Tuning BERT🎯:CommonLit Readability | Kaggle

2022-09-16T09:49:38Z

> Prompt-tuning is a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks.Soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Finally, we show that conditioning,a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.

Active Learning for BERT: An Empirical Study - ACL Anthology

2022-09-02T16:08:49Z

> The use of Actice Learning (AL) with deep pre-trained models has so far received little consideration. > > We study the potential of (i) various AL strategies; (ii) in conjunction with BERT, (iii) within a highly challenging – yet common – real-world scenario of class imbalance and scarce labeled data. focused on binary classification > AL can boost BERT performance, especially in the most realistic scenario in which the initial set of labeled examples is created using keyword-based queries, resulting in a biased sample of the minority class. [Github](https://github.com/IBM/low-resource-text-classification-framework)

[2106.10199] BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

2022-09-01T17:20:28Z

> BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset of them) are being modified. We show that **with small-to-medium training data, applying BitFit on pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model.** > **these findings support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge** -- ???!!! > The focus on modifying a small group of parameters eases deployment, as the vast majority of the parameters of the model are shared between various NLP tasks [GitHub](https://github.com/benzakenelad/BitFit)

On Stability of Few-Sample Transformer Fine-Tuning | Kaggle

2022-08-29T01:13:58Z

[[2006.05987] Revisiting Few-sample BERT Fine-tuning](doc:2022/03/2006_05987_revisiting_few_sam)

Unsupervised Learning — Sentence-Transformers documentation

2022-08-20T01:16:16Z

> In our paper TSDAE we compare approaches for sentence embedding tasks, and in GPL we compare them for semantic search tasks (given a query, find relevant passages). While the unsupervised approach achieve acceptable performances for sentence embedding tasks, they perform poorly for semantic search tasks.

Train and Fine-Tune Sentence Transformers Models

2022-08-13T09:49:57Z

[2205.00820] Entity-aware Transformers for Entity Search

2022-07-12T08:18:56Z

> **Do BERT-based entity retrieval models benefit from additional entity information stored in knowledge graphs?** To address this research question, we map entity embeddings into the same input space as a pre-trained BERT model and inject these entity embeddings into the BERT model. This entity-enriched language model is then employed on the entity retrieval task. > we observe empirically that the entity-enriched BERT models **enable fine-tuning on limited training data**, which otherwise would not be feasible due to the known instabilities of BERT in few-sample fine-tuning Uses [Wikipedia2Vec](tag:wikipedia2vec) as graph embedding method

Leshem Choshen sur Twitter : "Computational (Chomskian) hierarchies can predict OOD capabilities..."

2022-07-11T11:10:31Z

About a paper by DeepMind ["Neural Networks and the Chomsky Hierarchy"](https://arxiv.org/abs/2207.02098) > for our subset of tasks, RNNs and Transformers fail to generalize on non-regular tasks... only networks augmented with structured memory (such as a stack or memory tape) can successfully generalize on context-free and context-sensitive tasks

[2206.10658] Questions Are All You Need to Train a Dense Passage Retriever

2022-07-06T23:39:29Z

> **approach for training dense retrieval models that does not require any labeled training data**. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. > > ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer documents). > > It uses a new document-retrieval autoencoding scheme, where > 1. an input question is used to retrieve a set of evidence documents, and > 2. the documents are then used to compute the probability of reconstructing the original question. > > Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. [Tweet](doc:2022/07/devendra_singh_sachan_sur_twitt) > Given an input question, ART first retrieves a small set of possible evidences documents. It then recon structs the original question by attending to these documents > > The key idea in ART is to consider the retrieved documents as a noisy representation of the original question and question reconstruction probability as a way of denoising that provides soft-labels for how likely each document is to have been the correct result Refers to [[IZACARD 2012.04584] Distilling Knowledge from Reader to Retriever for Question Answering](doc:2020/12/2012_04584_distilling_knowled)

Unveiling Transformers with LEGO - YouTube

2022-06-30T14:21:53Z

> To me, what's good about transformers is that they have relative filters. I mean **a standard NN tests an input against a fixed filter w, but here we test part of x against another part of x**. (#[Self-Attention](tag:self_attention)) > > This potentially allows for reasonning to emerge: the network can associate concepts that it encounters, compare them, make analogies > LEGO: Learning Equality and Group Operations. It's a very **basic reasoning task**, where a sentence is made of clauses defining variables as a function of some other variable, and the goal is to **resolve the value of the variables**.

Using BERT For Classifying Documents with Long Texts | by Armand Olivares | Medium

2022-06-29T18:09:51Z

Chris Olah sur Twitter : "I'm excited to finally be making progress on understanding the first MLP layer in large transformer LMs. I've tried really hard and prior to SoLU had little success." / Twitter

2022-06-27T19:48:41Z

Google AI Blog: LIMoE: Learning Multiple Modalities with One Sparse Mixture-of-Experts Model

2022-06-26T01:20:55Z

> Sparse models stand out among the most promising approaches for the future of deep learning. Instead of every part of a model processing every input (“dense” modeling), sparse models employing conditional computation learn to route individual inputs to different “experts” in a potentially huge network

sentence bert model in onnx format · Issue #46 · UKPLab/sentence-transformers

2022-06-13T12:38:47Z

[2205.15952] Knowledge Graph -- Deep Learning: A Case Study in Question Answering in Aviation Safety Domain

2022-06-11T01:48:52Z

Domain transfer with GGPL: German Generative Pseudo Labeling 🥨 | by Matthias Richter | Jun, 2022 | ML6team

2022-06-02T13:55:12Z

Nils Reimers sur Twitter : "GPL goes multi-lingual..."

2022-06-01T17:45:24Z

[Domain transfer with GGPL: German Generative Pseudo Labeling](doc:2022/06/domain_transfer_with_ggpl_germ)

[2205.08184] SKILL: Structured Knowledge Infusion for Large Language Models

2022-05-18T23:57:17Z

> a method to infuse structured knowledge into LLMs, by directly training T5 models on factual triples of knowledge graphs > The models pre-trained on factual triples compare competitively with the ones on natural language sentences that contain the same knowledge. > The proposed method has an advantage that no alignment between the knowledge graph and text corpus is required

[2205.05131] Unifying Language Learning Paradigms

2022-05-12T12:12:04Z

BERTopic: The Future of Topic Modeling | Pinecone

2022-05-12T09:01:55Z

[2205.04260] EASE: Entity-Aware Contrastive Learning of Sentence Embedding

2022-05-11T01:25:12Z

> we explore a type of supervision that has been under-explored in the literature: entity hyperlink annotations from Wikipedia. > > entities have been shown to be a strong indicator of text semantics > > a method for mining hard negatives based on the entity type Uses wikipedia2vec > the reliance on Wikipedia for training data may limit the application of the models to specific domains (e.g., general or encyclopedia domains). To apply EASE to other domains, one may need to annotate text from the domain either manually or automatically.

[2203.08913] Memorizing Transformers

2022-05-07T09:01:26Z

[tweet](https://twitter.com/LiamFedus/status/1522605777961119745?s=20&t=Jt9GBjNcFw6TqeqYvz_BRA): Memorizing Transformers which increases context length up to 262k by an external memory of (keys, values) for that document. - Matches quality of Transformers 5x larger - Can fine-tune a prior pre-trained models to use it > Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus acquiring new knowledge immediately

Ramsri Goutham Golla sur Twitter : "Hi @Nils_Reimers For GPL you used "msmarco-distilbert-base-tas-b" model and ..."

2022-04-27T22:17:10Z

[1909.00426] Global Entity Disambiguation with BERT

2022-04-18T19:49:22Z

[2110.08151] mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models

2022-04-17T23:20:52Z

[Ikuya Yamada sur Twitter : "Is entity representation effective to improve multilingual language models?..."](doc:2022/04/ikuya_yamada_sur_twitter_is_) > Recent studies have shown that multilingual pretrained language models can be effectively improved with cross-lingual alignment information from Wikipedia entities. However, **existing methods only exploit entity information in pretraining and do not explicitly use entities in downstream tasks**. In this study, we explore the **effectiveness of leveraging entity representations for downstream cross-lingual tasks**. > > the key insight is that incorporating entity representations into the input allows us to extract more language-agnostic features. [Github](https://github.com/studio-ousia/luke) > Entity representations are known to enhance language models in mono-lingual settings (Zhang et al., 2019: [ERNIE](tag:ernie.html); Peters et al., 2019: [[1909.04164] Knowledge Enhanced Contextual Word Representations](doc:2020/05/1909_04164_knowledge_enhanced); Wang et al., 2021 [[1911.06136] KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation](doc:2020/11/1911_06136_kepler_a_unified_); Xiong et al., 2020; Yamada et al., 2020: [[2010.01057] LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](doc:2020/11/2010_01057_luke_deep_context)) presumably by introducing real-world knowledge. We show that using entity representations facilitates cross-lingual transfer by providing languageindependent features. > > Multilingual extension of LUKE. The model is trained with the multilingual masked language modeling (MLM) task as well as the masked entity prediction (MEP) task with Wikipedia entity embeddings > We investigate two ways of using the entity representations in cross-lingual transfer tasks: > 1. perform entity linking for the input text, and append the detected entity tokens to the input sequence. The entity tokens are expected to provide language independent features to the model > 2. use the entity [MASK] token from the MEP task as a languageindependent feature extractor.

Ikuya Yamada sur Twitter : "Is entity representation effective to improve multilingual language models?..."

2022-04-13T15:46:06Z

[[2110.08151] mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](doc:2022/04/2110_08151_mluke_the_power_o) > mLUKE, an extension of [LUKE](tag:luke) based on 1M Wikidata entity embeddings shared across languages > mLUKE solves downstream tasks by using its language-agnostic entity embeddings as inputs. > entity representations are shared across languages during pretraining -> they are much more language-agnostic than word representations

Tu Vu sur Twitter : "Enormous LMs like GPT-3 exhibit impressive few-shot performance, but w/ self-training a BERT base sized model can achieve much better results!

2022-04-13T13:37:58Z

> [[2109.06270] STraTA: Self-Training with Task Augmentation for Better Few-shot Learning](doc:2022/04/2109_06270_strata_self_train) [Github](https://github.com/google-research/google-research/tree/master/STraTA) [at HuggingFace](https://github.com/huggingface/transformers/tree/main/examples/research_projects/self-training-text-classification) -- Remark: Like [[2203.10581] Cluster & Tune: Boost Cold Start Performance in Text Classification](doc:2022/04/2203_10581_cluster_tune_bo), adds an intermediate fine-tuning step // TODO compare

Google AI Blog: Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

2022-04-05T22:16:07Z

[2008.11228] A simple method for domain adaptation of sentence embeddings

2022-04-01T14:07:28Z

[2004.05119] Beyond Fine-tuning: Few-Sample Sentence Embedding Transfer

2022-03-31T21:04:02Z

> Fine-tuning (FT) pre-trained sentence embedding models on small datasets has been shown to have limitations. In this paper we show that concatenating the embeddings from the pre-trained model with those from a simple sentence embedding model trained only on the target data, can improve over the performance of FT for few-sample tasks

Sentence Transformer Fine-Tuning (SetFit): Outperforming GPT-3 on few-shot Text-Classification while being 1600 times smaller | by Moshe Wasserblat (2021-12)

2022-03-31T10:49:48Z

Finetuning d'un SBERT sur une tâche de classification (in fine, produit un SBERT) > **Few-shot text classification based on fine-tuning a Sentence Transformer with task-specific data** that can easily be implemented with the sentence-transformers library > Surprisingly, we did not find any work that performed an end-to-end ST fine-tuning for text classification in a Siamese manner. [COLAB](https://colab.research.google.com/github/MosheWasserb/SetFit/blob/main/SetFit_SST_2.ipynb) [Nils Reimers sur Twitter](doc:2022/03/nils_reimers_sur_twitter_gre)

Nils Reimers sur Twitter : "Great post on SetFit"

2022-03-31T10:48:50Z

About [Sentence Transformer Fine-Tuning (SetFit): Outperforming GPT-3 on few-shot Text-Classification while being 1600 times smaller | by Moshe Wasserblat](doc:2022/03/sentence_transformer_fine_tunin) > - Outperforms GPT-3 in few-shot text-classification (50 labeled examples, secret test set) > - 1600 times smaller > - Can be run on your CPU > - No limitation on the number of training examples > - Just few lines of code needed

Sentence Embedding Fine-tuning for the French Language | by La Javaness R&D | Feb, 2022 | Medium

2022-03-31T10:06:14Z

Domain Adaptation — Sentence-Transformers documentation

2022-03-31T08:59:25Z

[2203.14655] Few-Shot Learning with Siamese Networks and Label Tuning

2022-03-30T16:14:44Z

> the problem of building text classifiers with little or no training data. > > In recent years, an approach based on neural textual entailment models has been found to give strong results on a diverse range of tasks. (cf. #[NLI](tag:nli), using the input text as the premise and the text representing the label as the hypothesis) > In this work, we show that **with proper pre-training, Siamese Networks that embed texts and labels** offer a competitive alternative. > > We introduce **label tuning: fine-tuning the label embeddings only**. While giving lower performance than model fine-tuning (which updates all params of the model), this approach has the architectural advantage that a single encoder can be shared by many different tasks (we only fine-tune the label embeddings) > The drop in quality can be compensated by using a variant of **[Knowledge distillation](tag:knowledge_distillation)** [Github](https://tinyurl.com/label-tuning), [Tweet](doc:2022/03/thomas_muller_sur_twitter_pa)

[2006.05987] Revisiting Few-sample BERT Fine-tuning

2022-03-21T10:46:15Z

> A study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. > The most commonly used optimizer for fine-tuning BERT is BERTADAM, a modified version of the ADAM first-order stochastic optimization method. It differs from the original ADAM algorithm (Kingma & Ba, 2014) in omitting a bias correction step. > > ... We observe that the bias correction omission influences the learning rate, especially early in the fine-tuning process, and is one of the primary reasons for instability in fine-tuning BERT and this is bad when finetuning with less than 10K samples. Pb included in many > open source libraries, including the official implementation huggingface’s Transformers How to solve pb in HuggingFace? > HuggingFace Transformers AdamW has correct_bias parameter set to True by default. Still it's worth noting the importance this parameter serves. [src](doc:2022/08/on_stability_of_few_sample_tran)

NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece | by Pierre Guillou | Medium

2022-03-18T17:41:40Z

Studio Ousia sur Twitter : "Now using LUKE is easier than ever!" / Twitter

2022-03-15T20:47:39Z

Andrew Trask about large language models: The "bigness" is a temporary flaw, not a permanent feature of progress"

2022-03-13T09:16:01Z

MaartenGr/BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.

2022-03-10T09:41:50Z

> topic modeling technique that leverages 🤗 transformers and [c-TF-IDF](https://github.com/MaartenGr/cTFIDF) to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. refers to [Top2Vec](doc:2022/03/ddangelov_top2vec_top2vec_lear) [youtube](https://www.youtube.com/watch?v=Qub3PrFvauI) [tweet](https://twitter.com/JayAlammar/status/1594681648121102336?s=20&t=R0G_LrajK9WBtzypwXtD7Q)

Document Matching for Job Descriptions | Semantic Scholar (2021)

2022-03-09T18:18:50Z

> We train a document encoder to match online job descriptions to one of many standardized job roles from Singapore’s Skills Framework. The encoder generates semantically meaningful document encodings from textual descriptions of job roles, which are then compared using Cosine Similarity to determine matching. During training, we implement the methodology used by Sentence-BERT, fine tuning pre-trained BERT models using a siamese network architecture on labelled document pairs.

NAVER LABS Europe : "@Nils_Reimers of @huggingface on 'Unsupervised domain adaptation for neural search'"

2022-03-09T10:53:24Z

[2109.06304] Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

2022-02-25T17:19:37Z

Nils Reimers sur Twitter : "Creating intent classes for chatbots is challenging This tutorial shows how to use sentence-transformers to find potentially overlapping intent classes and how to improve your data annotation work." / Twitter

2022-02-19T22:55:07Z

Nils Reimers sur Twitter : "how to use the fast clustering algorithm from sentence-transformers..."

2022-02-19T10:37:15Z

Clustering millions of sentences to optimize the ML-workflow

sentence-transformers/fast_clustering.py at master · UKPLab/sentence-transformers

2022-02-18T14:45:22Z

> This is a more complex example on performing clustering on large scale dataset. This examples find in a large set of sentences local communities, i.e., groups of sentences that are highly similar. You can freely configure the threshold what is considered as similar. A high threshold will only find extremely similar sentences, a lower threshold will find more sentence that are less similar. A second parameter is 'min_community_size': Only communities with at least a certain number of sentences will be returned. The method for finding the communities is extremely fast, for clustering 50k sentences it requires only 5 seconds (plus embedding comuptation). In this example, we download a large set of questions from Quora and then find similar questions in this set.

gsarti/scibert-nli · Hugging Face

2022-01-29T15:52:08Z

SciBERT fine-tuned on the SNLI and the MultiNLI datasets using the sentence-transformers library to produce universal sentence embeddings

Semantic Search — Sentence-Transformers documentation

2022-01-29T15:28:25Z

**symmetric** semantic search vs **asymmetric** semantic search > - Suitable models for symmetric semantic search: Pre-Trained Sentence Embedding > - Suitable models for asymmetric semantic search: Pre-Trained MS MARCO Models

[1906.00300] Latent Retrieval for Weakly Supervised Open Domain Question Answering

2022-01-11T11:06:38Z

> The key insight of this work is that end-to-end learning is possible if we pre-train the retriever with an unsupervised Inverse Cloze Task (ICT). In ICT, a sentence is treated as a pseudo- question, and its context is treated as pseudo- evidence

Domain Transfer with BERT | Pinecone

2022-01-04T21:00:34Z

Anthropic sur Twitter : "a mathematical framework for trying to reverse engineer transformer language models..."

2021-12-23T00:41:38Z

Making the Most of Data: Augmentation with BERT | Pinecone

2021-12-18T10:05:41Z

Using pretrained SBERT model in cross-encoder · Issue #726 · UKPLab/sentence-transformers

2021-12-17T00:41:33Z

> so would it be a good idea to finetune a SBERT model on a cross-encoder task? > > The SBERT models are regular transformers model and hence can be used as base for cross encoders. Sometimes it could be helpful, otherwise it is better to use the original models. ([Nils Reimers](tag:nils_reimers))

Advance BERT model via transferring knowledge from Cross-Encoders to Bi-Encoders | by Chien Vu | Towards Data Science

2021-12-17T00:26:39Z

Data Augmentation Method to improve SBERT Bi-Encoders for Pairwise Sentence Scoring Tasks (Semantic sentence tasks)

[2112.07577] GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

2021-12-15T18:23:28Z

An unsupervised domain adaptation technique for dense retrieval models 1. synthetic queries are generated for each passage from the target corpus (using an existing pre-trained [T5](tag:text_to_text_transfer_transformer) encoder-decoder) 2. the generated queries are used for mining negative passages (retrieving the most similar paragraphs using an existing dense retrieval model == hard negatives!) 3. the query-passage pairs are labeled by a cross-encoder and used to train the domain-adapted dense retriever (using method described in [Hofstätter et al., 2020](doc:2021/12/2010_02666_improving_efficien)) [Nils Reimers sur Twitter](doc:2021/12/nils_reimers_sur_twitter_do_), [GitHub](https://github.com/UKPLab/gpl), by the author of [TSDAE](doc:2021/09/2104_06979_tsdae_using_trans) Claims to improve "Doc2Query" [Document Expansion by Query Prediction](doc:2022/01/1904_08375_document_expansion): ([src](https://twitter.com/KexinWang2049/status/1471435779415150598)) > - GPL: Uses doc2query to construct synthetic data and does knowledge distillation (i.e. training) on that data. > - Doc2query: Generates queries to extend the documents and use BM25 on top of them w/o training.

Improving Language Models by Retrieving from Trillions of Tokens | DeepMind

2021-12-09T10:11:10Z

> Retrieval-Enhanced Transformer (Retro)

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

2021-12-05T10:48:53Z

Unsupervised_Extractive_Summarization - a Hugging Face Space by Hellisotherpeople

2021-12-03T09:28:38Z

Unsupervised Extractive Text Summarization and Semantic Search [Github](https://github.com/Hellisotherpeople/CX_DB8)

Unsupervised Training for Sentence Transformers | Pinecone

2021-11-24T21:03:44Z

Blog post about [[2104.06979] TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](doc:2021/09/2104_06979_tsdae_using_trans) > Fine-tuning with TSDAE simply cannot compete in terms of performance against supervised methods. However, **the point and value of TSDAE is that it allows us to fine-tune models for use-cases where we have no data**. Specific domains with unique terminology or low resource languages.

How to Fine-Tune Sentence-BERT for Question Answering | Capital One

2021-11-21T12:38:13Z

> tutorial on using the sentence-transformers library to fine-tune Sentence-BERT for question matching

Multilingual Sentence Transformers | Pinecone

2021-11-04T23:09:34Z

How to make a text encoder multilingual using sentence transformers and multilingual knowledge distillation.

Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations – Google Research (WWW 2020)

2021-11-04T17:31:42Z

> a novel negative sampling approach called **Mixed Negative Sampling (MNS**). In particular, different from commonly used batch or unigram sampling methods, MNS uses a mixture of batch and uniformly sampled negatives to tackle the selection bias of implicit user feedback (voir si ça a un rapport avec [Multiple Negatives Ranking Loss](doc:2021/10/next_gen_sentence_embeddings_wi))

Train embeddings by using the Two-Tower built-in algorithm | Vertex AI

2021-11-04T17:23:31Z

> The Two-Tower model pairs similar types of objects, such as user profiles, search queries, web documents, answer passages, or images, in the same vector space, so that related items are close to each other. **The Two-Tower model consists of two encoder towers: the query tower and the candidate tower**. These towers embed independent items into a shared embedding space, which lets Matching Engine retrieve similarly matched items. > > To train a Two-Tower model, Google uses **pairs of relevant items**. Each pair consists of a query document and a candidate document. Documents contain arbitrary customer-defined features including text, numeric, and categorical features. After training, the Two-Tower built-in algorithm exports two TensorFlow SavedModels—a query encoder and a candidate encoder... Given a query item, Matching Engine uses the query encoder to generate a query embedding, and uses the index to find similar candidate embeddings. Matching Engine uses the candidate encoder to index all the items and serve them by using an approximate nearest neighbor solution.

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines (2021)

2021-10-30T09:14:09Z

> **an analysis of the fine-tuning instability of BERT-based models and a simple method to fix it** > > Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. > > 2 potential reasons identified in (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) : > - catastrophic forgetting > - small size of the fine-tuning datasets. > > we show that both hypotheses fail to explain the fine-tuning instability, which is caused by optimization difficulties / **vanishing gradients**). > > A simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches. > > [Github](https://github.com/uds-lsv/bert-stable-fine-tuning)

Next-Gen Sentence Embeddings with Multiple Negatives Ranking Loss | Pinecone

2021-10-27T01:24:49Z

> the world of sentence embeddings was ignited with the introduction of SBERT in 2019. Since then, many more sentence transformers have been introduced. These models quickly made the original SBERT obsolete. How did these newer sentence transformers manage to outperform SBERT so quickly? The answer is **multiple negatives ranking (MNR) loss**. > In short; **fine-tune your models with MNR loss, and do it with the [sentence-transformers](tag:sbert) library**. (mentionned in a [tweet](https://twitter.com/Nils_Reimers/status/1453001422400856086) by [Nils Reimers](tag:nils_reimers))

Sentence Embeddings and Transformers | Pinecone

2021-10-23T01:04:37Z

AlphaFold 2 is here: what’s behind the structure prediction miracle | Oxford Protein Informatics Group

2021-10-20T00:31:53Z

> to recap: AlphaFold 2 finds similar sequences to the input, extracts the information using an especial neural network architecture (a transformer), and then passes that information to another neural network that produces a structure.

L’intelligence artificielle, génie de la biologie moléculaire

2021-10-20T00:26:36Z

Sahajtomar/french_semantic · Hugging Face

2021-10-14T16:08:39Z

Omer Levy sur Twitter : "What if I told you that fine-tuning T5-Large (0.8B params) on a couple hundred examples could outperform GPT-3 (175B params) on a bunch of tasks?"

2021-10-13T12:53:20Z

Google AI Blog: Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer (2020)

2021-10-13T12:49:44Z

> With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks

[2106.04647] Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

2021-09-29T02:05:29Z

> Compacter (Compact Adapter) layers, a method to adapt large-scale language models, which only trains around 0.05% of a model's parameters and performs on par with fine-tuning. [twitter](https://twitter.com/KarimiRabeeh/status/1404774464441794560)

[2010.12566] DICT-MLM: Improved Multilingual Pre-Training using Bilingual Dictionaries

2021-09-06T18:27:44Z

> Despite the strong representation learning capability enabled by MLM, we demonstrate an inherent limitation of MLM for multilingual representation learning. In particular, by requiring the model to predict the language-specific token, the MLM objective disincentivizes learning a language-agnostic representation -- which is a key goal of multilingual pre-training > > DICT-MLM works by incentivizing the model to be able to predict not just the original masked word, but potentially any of its crosslingual synonyms as well.

Google AI Blog: From Vision to Language: Semi-supervised Learning in Action…at Scale

2021-07-14T23:34:40Z

Semi-Supervised Distillation (SSD). First, the teacher model infers pseudo-labels on the unlabeled dataset from which we then train a new teacher model (T’) that is of equal-or-larger size than the original teacher model. This step, which is essentially self-training, is then followed by knowledge distillation to produce a smaller student model for production.

[2102.07043] Reasoning Over Virtual Knowledge Bases With Open Predicate Relations

2021-06-20T08:30:31Z

> a method for constructing **a virtual KB (VKB) trained entirely from text** Open Predicate Query Language (OPQL): constructing a virtual knowledge base (VKB) that supports KB reasoning & open-domain QA, tackling the incompleteness of knowledge bases by constructing a virtual KB only from text > OPQL constructs a VKB by **encoding and indexing a set of relation mentions** in a way that naturally enables reasoning and can be trained without any structured supervision. > can be used as an **external memory integrated into a language model** cf. this earlier paper [[2002.10640] Differentiable Reasoning over a Virtual Knowledge Base](doc:2020/07/2002_10640_differentiable_rea). But does not require an initial structured KB for distant supervision. > The key idea in constructing the OPQL VKB is to use a dual-encoder pre-training process, similar to [[1906.03158] Matching the Blanks: Distributional Similarity for Relation Learning](doc:2021/05/1906_03158_matching_the_blank) Related work section refers to [[1909.04164] Knowledge Enhanced Contextual Word Representations](doc:2020/05/1909_04164_knowledge_enhanced). Also refers to [[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge](doc:2020/07/2007_00849_facts_as_experts_) (some authors in common)

Semantic Search with S-BERT is all you need

2021-06-05T16:02:26Z

> SentenceTransformers is designed in such way that fine-tuning your own sentence / text embeddings models is easy.

Making sense of raw input

2021-05-21T12:09:43Z

>... this way we are able to **jointly learn** how to perceive (**mapping raw sensory information to concepts**) and apperceive (**combining concepts into declarative rules**) cf. [Making sense of sensory input](doc:2021/04/1910_02227_making_sense_of_se)

[1906.03158] Matching the Blanks: Distributional Similarity for Relation Learning

2021-05-13T00:39:03Z

> a new method of learning relation representations directly from text > > First, we study the **ability of the Transformer neural network architecture (Vaswani et al., 2017) to encode relations between entity pairs**, and we identify a method of representation that outperforms previous work in supervised relation extraction. Then, we present a method of training this relation representation **without any supervision from a knowledge graph or human annotators** from widely available distant supervision in the form of entity linked text > > **we assume** access to a corpus of text in which entities have been linked to unique identifiers and we define a relation statement to be a block of text containing two marked entities.

[1909.10506] Learning Dense Representations for Entity Retrieval

2021-05-01T09:11:15Z

> We show that it is feasible to perform **entity linking by training a dual encoder (two-tower) model that encodes mentions and entities in the same dense vector space**, where candidate entities are retrieved by approximate nearest neighbor search. Unlike prior work, **this setup does not rely on an alias table followed by a re-ranker, and is thus the first fully learned entity retrieval model**. Contributions: > - a dual encoder architecture for learning entity and mention encodings suitable for retrieval. A key feature of the architecture is that it employs a modular **hierarchy of sub-encoders that capture different aspects of mentions and entities** > - a simple, fully unsupervised **hard negative mining** strategy that produces massive gains in retrieval performance, compared to using only random negatives > - high quality candidate entities very efficiently using approximate nearest neighbor search > - outperforms discrete retrieval baselines like an alias table or BM25 > strong retrieval performance across all 5.7 million Wikipedia entities in around 3ms per mention > since we are using a two-tower or dual encoder architecture, **our model cannot use any kind of attention over both mentions and entities at once**, nor feature-wise comparisons as done by Francis-Landau et al. (2016). This is a fairly severe constraint – for example, **we cannot directly compare the mention span to the entity title** – but it permits retrieval with nearest neighbor search for the entire context against a single, all encompassing representation for each entity

Nils Reimers sur Twitter : "SBERT Release v1.1.0"

2021-04-22T19:35:49Z

[2011.05864] On the Sentence Embeddings from Pre-trained Language Models

2021-04-19T01:13:25Z

> **the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences.** > > We find that **BERT always induces a non-smooth anisotropic semantic space of sentences**, which harms its performance of semantic similarity. To address this issue, we propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective > normalizing flows (Dinh et al., 2015): invertible function parameterized by neural networks. > **During training, only the flow network is optimized while the BERT parameters remain unchanged** > When combined with external supervision from natural language inference tasks (Bowman et al., 2015; Williams et al., 2018), our method outperforms the [Sentence-BERT](tag:sbert) embeddings [GitHub](https://github.com/bohanli/BERT-flow)

SimCSE: Simple Contrastive Learning of Sentence Embeddings

2021-04-18T18:28:29Z

(by one of the authors of [KEPLER](doc:2020/11/1911_06136_kepler_a_unified_)) a contrastive sentence embedding framework, which can be used to produce sentence embeddings, from either unlabeled or labeled data. > 1. **an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout** used as noise > 2. we draw inspiration from the recent success of learning sentence embeddings from natural language inference (NLI) datasets and incorporate annotated pairs from NLI datasets into contrastive learning by using “entailment” pairs as positives and “contradiction” pairs as hard negatives Cites [[2011.05864] On the Sentence Embeddings from Pre-trained Language Models](doc:2021/04/2011_05864_on_the_sentence_em) (question of the anisotropic semantic space of BERT's sentences)

Nils Reimers sur Twitter : "New models for Neural Information Retrieval..."

2021-04-17T10:07:14Z

[2007.12603] IR-BERT: Leveraging BERT for Semantic Search in Background Linking for News Articles

2021-04-12T18:27:34Z

[2007.15779] Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

2021-04-11T16:38:59Z

> A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that **for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models**

[1902.00751] Parameter-Efficient Transfer Learning for NLP

2021-04-11T13:13:13Z

**Adapter tuning for NLP**. A strategy for tuning a large text model on several downstream tasks, that permits training on tasks sequentially, and that adds only a small number of additional parameters per task. New modules added between layers of a pre-trained network. Parameters of the original network are frozen and therefore may be shared by many tasks. [GitHub google-research/adapter-bert](https://github.com/google-research/adapter-bert)

exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources - ACL Anthology

2021-04-11T10:13:43Z

**Focus on the Embedding of Domain-specific Vocabulary.** > exBERT adds a new domain-specific vocabulary and the corresponding embedding layer, as well as a small extension module to the original unmodified model > a pretraining method allowing **low-cost embedding of domain-specific vocabulary in the context of an existing large pre-trained model such as BERT** > exBERT... explicitly incorporates the new domain’s vocabulary, while being able to **reuse the original pre-trained model’s weights as is** to reduce required computation and training data. Specifically, exBERT extends BERT by augmenting its embeddings for the original vocabulary with new embeddings for the domain-specific vocabulary via **a learned small “extension” module**. **The output of the original and extension modules are combined via a trainable weighted sum operation** In a way similar to concept developed in > [[1902.00751] Parameter-Efficient Transfer Learning for NLP](doc:2021/04/1902_00751_parameter_efficien), but not in the fine-tuning paradigm. [Github](https://github.com/cgmhaicenter/exBERT)

[1910.02227] Making sense of sensory input

2021-04-10T19:09:06Z

> what does it mean to “make sense” of a sensory sequence? Our answer is that making sense means constructing a symbolic theory containing a set of objects that persist over time, with attributes that change over time, according to general laws. This theory must both explain the sensory input, and satisfy unity conditions [the constituents of our theory – objects, properties, and atoms – must be integrated into a coherent whole] Sequel: [Making sense of raw input](doc:2021/05/making_sense_of_raw_input)

[1901.04085] Passage Re-ranking with BERT

2021-03-26T01:49:42Z

a simple re-implementation of BERT for query-based passage re-ranking ["Slides of our WSDM 2021 tutorial "Pretrained Transformers for Text Ranking: BERT and Beyond"](doc:2021/03/rodrigo_nogueira_sur_twitter_)

SentenceTransformers Documentation

2021-03-25T19:05:01Z

Rodrigo Nogueira sur Twitter : "Slides of our WSDM 2021 tutorial "Pretrained Transformers for Text Ranking: BERT and Beyond"

2021-03-09T08:09:28Z

Zero-Shot Learning in Modern NLP | Joe Davison Blog (2020-05)

2021-02-23T13:44:34Z

> state-of-the-art NLP models for sequence classification without large annotated training sets. Simple idea: use a single model (eg. [Sentence-BERT](tag:sbert)) to embed both the text data and the class names into the same space. Pb: Sentence-BERT is designed to learn effective sentence-level, not single- or multi-word representations like our class names -> the label embeddings may not be as semantically salient as word-level embedding methods (i.e. word2vec). Solution 1: Learn a projection from sentence level embeddings of words to word2vec embeddings, use it for encoding when learning classifier. Can be adapted to few short learning Solution 2: "Classification as [#Natural Language Inference](tag:nli)". > A method which not only embeds sequences and labels into the same latent space where their distance can be measured, but that can actually tell us something about the compatibility of two distinct sequences out of the box.

kamalkraj/BERT-NER: Pytorch-Named-Entity-Recognition-with-BERT

2021-02-07T11:37:39Z

Use google BERT to do CoNLL-2003 NER !

[1911.03681] E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

2021-01-12T18:31:21Z

> way of **injecting factual knowledge about entities into the pretrained BERT model**. (Feeding entity vectors into BERT as if they were wordpiece vectors without additional encoder pretraining) > > **We align [Wikipedia2Vec](tag:wikipedia2vec) entity vectors (Yamada et al., 2016) with BERT's native wordpiece vector space and use the aligned entity vectors as if they were wordpiece vectors**. The resulting entity-enhanced version of BERT (called E-BERT) is similar in spirit to [ERNIE](tag:ernie) (Zhang et al., 2019) and [KnowBert](tag:knowbert) (Peters et al., 2019), but it **requires no expensive further pretraining of the BERT encoder**. > > Our vector space alignment strategy is inspired by cross-lingual word vector alignment Related work on Entity-enhanced BERT: > ([ERNIE](doc:2019/08/_1905_07129_ernie_enhanced_la) and [Knowbert](doc:2020/05/1909_04164_knowledge_enhanced)) are based on the design principle that BERT be adapted to entity vectors. They introduce new encoder layers to feed pretrained entity vectors into the Transformer, and they require additional pretraining to integrate the new parameters. In contrast, E-BERT’s design principle is that entity vectors be adapted to BERT. > > Two other knowledge-enhanced MLMs are [KEPLER](doc:2020/11/1911_06136_kepler_a_unified_) (Wang et al., 2019c) and K-Adapter (Wang et al., 2020)... Their factual knowledge does not stem from entity vectors – instead, they are trained in a multi-task setting on relation classification and knowledge base completion. Not to be cofounded with [[2009.02835] E-BERT: A Phrase and Product Knowledge Enhanced Language Model for E-commerce](doc:2020/12/2009_02835_e_bert_a_phrase_a)

google/tapas-base-finetuned-wtq · Hugging Face

2020-12-17T22:40:56Z

> a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion

[2009.02835] E-BERT: A Phrase and Product Knowledge Enhanced Language Model for E-commerce

2020-12-14T11:10:29Z

E-BERT, pre-training framework for product data. 1. to benefit from phrase-level knowledge: Adaptive Hybrid Masking, a new masking strategy, which allows the model to adaptively switch from learning preliminary word knowledge to learning complex phrases 2. leveraging product-level knowledge: training E-BERT to predict a product’s associated neighbors (product association) Resources used: - description of millions of products from the amazon dataset (title, description, reviews) - e-commerce phrases: extracted from above dataset using [AutoPhrase](doc:2020/12/autophrase_automated_phrase_mi) - product association graph: pairs of substitutable and complementary products extracted from amazon dataset Not to be confounded with [[1911.03681] E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT](doc:2021/01/1911_03681_e_bert_efficient_)

[2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training

2020-12-12T02:30:25Z

**Augment language model pre-training with a retriever module**, which is trained using the masked language modeling objective. > To capture knowledge in a more modular and interpretable way, we augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. **For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner**, using masked language modeling as the learning signal and backpropagating through a retrieval step that considers millions of documents Hum, #TODO: parallel to be drawn with techniques in [KG-augmented Language Models](tag:knowledge_graph_augmented_language_models) which focus "on the problem of capturing declarative knowledge in the learned parameters of a language model." [Google AI Blog Post](doc:2020/08/google_ai_blog_realm_integrat) [Summary](https://joeddav.github.io/blog/2020/03/03/REALM.html) for the [Hugging Face awesome-papers reading group](doc:2021/03/huggingface_awesome_papers_pap)

Supporting content decision makers with machine learning | Dec, 2020 | Netflix TechBlog

2020-12-11T13:34:30Z

Google AI Blog: Reformer: The Efficient Transformer

2020-12-09T12:07:13Z

Keyword Extraction with BERT | Towards Data Science

2020-12-06T10:07:17Z

A minimal method for extracting keywords and keyphrases. [GitHub](https://github.com/MaartenGr/KeyBERT/) > uses BERT-embeddings and simple cosine similarity to find the sub-phrases in a document that are the most similar to the document itself.

Salmon Run: Word Sense Disambiguation using BERT as a Language Model

2020-12-01T15:45:06Z

Domain-Specific BERT Models · Chris McCormick

2020-12-01T15:08:22Z

Chances are you won’t be able to pre-train BERT on your own dataset, for the following reasons: 1. Pre-training BERT requires a huge corpus 2. Huge Model + Huge Corpus = Lots of GPUs

[2010.01057] LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

2020-11-26T16:21:30Z

> LUKE is based on bidirectional Transformer, treats words and entities in a text as independent tokens, and outputs contextualized representations of them. The representations can be used to address downstream tasks similarly to BERT. [src](https://twitter.com/ikuyamada/status/1312947499141750786) > LUKE is trained using a novel pretraining task that involves predicting randomly masked words (equivalent to BERT’s masked language model) and entities in an entity-annotated corpus obtained from Wikipedia. (Hum, ça me rappelle quelque chose) > LUKE also uses a new *entity-aware* self-attention mechanism that considers the types of tokens (words or entities) when computing attention scores. [github](https://github.com/studio-ousia/luke), [at Hugging Face](https://twitter.com/AkariAsai/status/1389428550298525696), [doc](https://huggingface.co/transformers/model_doc/luke.html), [tweet](https://twitter.com/ikuyamada/status/1392742990586683392?s=20)

raphaelsty/ckb: Contextual knowledge bases

2020-11-09T16:10:42Z

Une implémentation de [BLP](tag:blp) [[2010.03496] Inductive Entity Representations from Text via Link Prediction](doc:2020/11/2010_03496_inductive_entity_r)

[2010.03496] Inductive Entity Representations from Text via Link Prediction

2020-11-03T16:38:59Z

BLP "BERT for Link Prediction". Central idea: **training an entity encoder with a link prediction objective** (using the textual descriptions of entities when computing entity representations - hence not failing with entities unknown in training) > a method for **learning representations of entities**, that uses a **pre-trained Transformer** based architecture as an entity encoder, and **link prediction training on a knowledge graph with textual entity descriptions**. > using entity descriptions, an entity encoder is trained for link prediction in a knowledge graph. The encoder can then be used without fine-tuning to obtain features for entity classification and information retrieval Cites [Xie et al](doc:2020/10/representation_learning_of_know) and [Kepler](doc:2020/11/1911_06136_kepler_a_unified_). They claim that their objective targeted exclusively for link prediction (and not an objective that combines language modeling and link prediction as Kepler) performs better than Kepler's more complex one.

Which flavor of BERT should you use for your QA task? | by Olesya Bondarenko | Towards Data Science

2020-10-04T23:31:57Z

A guide to choosing and benchmarking BERT models for question answering

[2010.00402] From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

2020-10-03T14:46:20Z

> The key idea of our method, HypHC, is showing a direct correspondence from discrete trees to continuous representations (via the hyperbolic embeddings of their leaf nodes) and back (via a decoding algorithm that maps leaf embeddings to a dendrogram), **allowing us to search the space of discrete binary trees with continuous optimization**. Cites [Dasgupta: A cost function for similarity-based hierarchical clustering](https://arxiv.org/abs/1510.05043)

Google AI Blog: REALM: Integrating Retrieval into Language Representation Models

2020-08-13T10:09:38Z

> a new open-source method for language model pre-training that uses a supplemental knowledge retriever that enables it to perform well on knowledge-intensive tasks without billions of parameters. > > The key intuition of REALM is that a retrieval system should improve the model's ability to fill in missing words [Paper: REALM: Retrieval-Augmented Language Model Pre-Training](doc:2020/12/2002_08909_realm_retrieval_a)

UKPLab/sentence-transformers: Sentence Embeddings with BERT & XLNet

2020-07-14T19:08:40Z

[paper](doc:2019/08/_1908_10084_sentence_bert_sen)

How to use BERT for finding similar sentences or similar news? · Issue #876 · huggingface/transformers

2020-07-12T15:26:41Z

links to [UKPLab/sentence-transformers](doc:2020/07/ukplab_sentence_transformers_s) [Another answer](https://github.com/huggingface/transformers/issues/2986)

[2004.07202] Entities as Experts: Sparse Memory Access with Entity Supervision

2020-07-11T15:09:10Z

> We focus on the problem of **capturing declarative knowledge in the learned parameters of a language model**... > Entities as Experts (EaE) can access distinct memories of the entities mentioned in a piece of text; > To understand the motivation for distinct and independent entity representations: A traditional Transformer would need to build an internal representation of Charles Darwin from the words “Charles” and “Darwin”... Conversely, EAE can access a dedicated representation of “Charles Darwin”, which is a memory of all of the contexts in which this entity has previously been mentioned.... Having retrieved and re-integrated this memory it is much easier for EAE to relate the question to the answer > EaE's entity representations are learned directly from text. Correct identification, and representation, of entities is essential to EaE's performance Based on transformer architecture Extension: [Facts as Experts](doc:2020/07/2007_00849_facts_as_experts_)

[2007.00849] Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge

2020-07-09T23:54:59Z

> a neural language model that includes **an explicit interface between symbolically interpretable factual information and subsymbolic neural knowledge.**... **The model can be updated without re-training by manipulating its symbolic representations**. In particular this model allows us to add new facts and overwrite existing ones. > a **neural language model which learns to access information in a symbolic knowledge graph.** > This model builds on the recently-proposed [Entities as Experts](doc:2020/07/2004_07202_entities_as_expert) (EaE) language model (Févry et al., 2020), which extends the same transformer (Vaswani et al., 2017) architecture of BERT (Devlin et al., 2019) with an additional external memory for entities. > > After training EaE, the embedding associated with an entity will (ideally) capture information about the textual context in which that entity appears, and by inference, the entity’s semantic properties > > we include an additional memory called a fact memory, which encodes triples from a symbolic KB. > > This combination results in a neural language model which learns to access information in a the symbolic knowledge graph. TODO: - read again IBM's [Span Selection Pre-training for Question Answering](doc:2019/09/_1909_04120_span_selection_pre) ("an effort to avoid encoding general knowledge in the transformer network itself") - compare with [[1907.05242] Large Memory Layers with Product Keys](doc:2019/07/_1907_05242_large_memory_layer) - how does it relate with [[2002.08909] REALM: Retrieval-Augmented Language Model Pre-Training](doc:2020/12/2002_08909_realm_retrieval_a)?

BERT Word Embeddings Tutorial · Chris McCormick

2020-07-06T14:51:33Z

Learning to Tag OOV Tokens by Integrating Contextual Representation and Background Knowledge (ACL Anthology 2020)

2020-07-04T11:34:35Z

Aim to leverage both contextual representation of input text (deep LMs) and knowledge derived from curated KBs ([Wordnet](tag:wordnet)) to improve [slot tagging](tag:slot_tagging) in the presence of [out-of-vocab](tag:oov) words ([few-shot scenario](tag:few_shot_learning)) Method: 1. retrieve potentially relevant KB entities and encode them into distributed representations that describe global graph-structured information 2. BERT encoder layer to capture context-aware representations of the sequence and attend to the KB embeddings using multi-level graph attention 3. integrate BERT embeddings and the KB embeddings to predict the slot type Contributions: 1. feasibility of applying lexical ontology to facilitate recognizing OOV words. First to consider the large-scale background knowledge for enhancing context-aware slot tagging models. 2. a knowledge integration mechanism that uses multi-level graph attention to model explicit lexical relations. 3.experiments on two benchmark datasets > our method makes a notable difference in a scenario where samples are linguistically diverse, and large vocab exists. (Better improvements when using RNN than BERT, because BERT already contains a lot of background knowledge)

Patrick von Platen sur Twitter : "Today, @huggingface is the start of our Reformer series..."

2020-06-29T19:07:30Z

[2001.04451] Reformer: The Efficient Transformer

2020-06-29T19:04:03Z

Representation Learning for Information Extraction from Form-like Documents – Google Research

2020-06-15T22:58:48Z

> a novel approach using representation learning for tackling the problem of **extracting structured information from form-like document images**. We propose an **extraction system that uses knowledge of the types of the target fields to generate extraction candidates**, and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. [Blog post](doc:2020/06/google_ai_blog_extracting_stru)

Google AI Blog: Extracting Structured Data from Templatic Documents (2020)

2020-06-15T22:51:23Z

[About this paper](doc:2020/06/representation_learning_for_inf) Templatic documents (eg. invoices): such documents do not contain “natural language” but instead resemble forms, with data often presented in tables > an approach that **uses knowledge of target field types to identify candidate fields**. These are then scored using **a neural network that learns a dense representation of each candidate using the words in its neighborhood**. Experiments on two corpora (invoices and receipts) show that we’re able to generalize well to unseen layouts. > > An understanding of the **two-dimensional layout of text** on the page is key to understanding such documents. On the other hand, treating this purely as an image segmentation problem makes it difficult to take advantage of the semantics of the text. > > Our approach to this problem allows developers to train and deploy an extraction system for a given domain (like invoices) using **two inputs — a target schema (i.e., a list of fields to extract and their corresponding types) and a small collection of documents labeled with the ground truth for use as a training set** - The input document is first run through an [OCR service](doc:2020/06/detecter_le_texte_dans_les_fich). - a candidate generator identifies spans of text in the OCR output that might correspond to an instance of a given field (uses pre-existing libraries associated with each field type) - Each candidate is then scored using a neural network (that is trained as a binary classifier)

[1804.03235] Large scale distributed neural network training through online distillation

2020-06-06T16:51:26Z

> we use *codistillation* to refer to distillation performed: > 1. using the same architecture for all the models; > 2. using the same dataset to train all the models; and > 3. using the distillation loss during training before any model has fully converged. > In general, we believe the quality gains of codistillation over well-tuned offline distillation will be minor in practice and the more interesting research direction is exploring codistillation as a distributed training algorithm > Codistillation with the same data seems to be slightly better than the baseline, but codistillation using different data gets much better results. These results show that the codistilling models are indeed successfully transmitting useful information about different parts of the training data to each other. Related to ["Deep mutual learning"](doc:2020/05/1706_00384_deep_mutual_learni) paper

On word embeddings

2020-06-05T01:31:14Z

History of word embeddings in the context of language modelling. [Next post in serie](doc:2020/06/approximating_the_softmax_for_l)

[1909.04164] Knowledge Enhanced Contextual Word Representations

2020-05-13T01:44:51Z

General method to **embed multiple knowledge bases into pre-trained language models** (KB in the sense as fixed collection of entity nodes) > The key idea is to explicitly model entity spans in the input text and use an **entity linker** to retrieve relevant entity embeddings from a KB to form knowledge enhanced entity-span representations. > Then, update contextual word representations via a form of **word-to-entity attention**. > In contrast to previous approaches, the entity linkers and self-supervised language modeling objective are jointly trained end-to-end in a multitask setting that **combines a small amount of entity linking supervision with a large amount of raw text**.

Thomas Kipf's PhD thesis: "Deep Learning with Graph-Structured Representations"

2020-05-05T15:47:55Z

Covers a range of emerging topics in Deep Learning: from graph neural nets (and graph convolutions) to structure discovery (objects, relations, events)

[1911.03814] Scalable Zero-shot Entity Linking with Dense Entity Retrieval

2020-05-02T11:43:47Z

> a two stage approach, based on fine-tuned BERT architectures. In the first stage, we do retrieval in a dense space defined by a bi-encoder that independently embeds the mention context and the entity descriptions (Humeau et al., 2019; Gillick et al., 2019). Each retrieved candidate is then examined more carefully with a cross-encoder that concatenates the mention and entity text,

BERT, ELMo, & GPT-2: How Contextual are Contextualized Word Representations? | SAIL Blog

2020-03-28T10:33:17Z

[1909.03193] KG-BERT: BERT for Knowledge Graph Completion

2020-03-22T18:56:43Z

Pre-trained language models for knowledge graph completion. **Triples are treated as textual sequences**. (Hum, j'ai déjà vu ça quelque part. Ah, peut-être [RDF2VEC](tag:rdf2vec)? // TODO à voir) Takes entity and relation descriptions of a triple as input and computes scoring function of the triple with the KG-BERT language model > we first treat entities, relations and triples as textual sequences and turn knowledge graph completion into a sequence classification problem. We then fine-tune BERT model on these sequences for predicting the plausibility of a triple or a relation. The method [GitHub](https://github.com/yao8839836/kg-bert)

[1909.07606] K-BERT: Enabling Language Representation with Knowledge Graph

2020-03-08T22:54:15Z

a knowledge-enabled language representation model (K-BERT) with knowledge graphs (KGs), in which triples are injected into the sentences as domain knowledge (Summarized in [Domain adaptation of word embeddings through the exploitation of in-domain corpora and knowledge bases (PhD Thesis 2021)](doc:2022/03/domain_adaptation_of_word_embed), p43)

Unsupervised NER using BERT - Hands-on NLP model review - Quora

2020-03-06T00:12:06Z

[GitHub](https://github.com/ajitrajasekharan/unsupervised_NER)

[2002.12327] A Primer in BERTology: What we know about how BERT works

2020-02-28T13:25:30Z

(article praised on [twitter](https://twitter.com/dennybritz/status/1233343170596917248?s=20) by D Britz and Y. Goldberg)

[2002.11402] Detecting Potential Topics In News Using BERT, CRF and Wikipedia

2020-02-27T23:36:54Z

Distilling BERT models with spaCy - Towards Data Science (2019)

2020-02-15T11:15:11Z

Hugging Face sur Twitter : DistilBERT-cased for Question Answering w/ just 3 lines of javascript

2020-02-14T00:23:36Z

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

2020-02-11T22:56:31Z

> It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. indeed, cf. Facebook's paper [Language Models as Knowledge Bases?](/doc/2019/09/_1909_01066_language_models_as) > In this short paper, we measure the practical utility of this approach by fine-tuning pre-trained models to answer questions without access to any external context or knowledge. > we show that a large language model pre-trained on unstructured text can attain competitive results on open-domain question answering benchmarks without any access to external knowledge BUT: >1. state-of-the-art results only with the largest model which had 11 billion parameters. >1. “open-book” models typically provide some indication of what information they accessed when answering a question that provides a useful form of interpretability. In contrast, our model distributes knowledge in its parameters in an inexplicable way, which precludes this form of interpretability. >1. **the maximum-likelihood objective provides no guarantees as to whether a model will learn a fact or not.** So, what's the point? To be compared with this [IBM's paper](/doc/2019/09/_1909_04120_span_selection_pre): "a new pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself"

Adam Roberts sur Twitter : "New preprint: How Much Knowledge Can You Pack into the Parameters of a Language Model?..."

2020-02-11T12:24:21Z

[paper](/doc/2020/02/how_much_knowledge_can_you_pack)

[1911.05507] Compressive Transformers for Long-Range Sequence Modelling

2020-02-11T08:48:20Z

> the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. [Blog post](/doc/2020/02/a_new_model_and_dataset_for_lon)

A new model and dataset for long-range memory | DeepMind

2020-02-11T08:40:48Z

the use of memory in deep learning, and how modelling language may be an ideal task for developing better memory architectures [paper](/doc/2020/02/_1911_05507_compressive_transf)

[2002.02925] BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

2020-02-10T21:50:03Z

approach to compress BERT by progressive module replacing. > Compared to the previous knowledge distillation approaches for BERT compression, our approach leverages only one loss function and one hyper-parameter [Github](https://github.com/JetRunner/BERT-of-Theseus)

Canwen Xu sur Twitter : "WTF? We brutally dismember BERT and replace all his organs?"

2020-02-10T09:21:44Z

[paper](/doc/2020/02/_2002_02925_bert_of_theseus_c)

[1503.03832] FaceNet: A Unified Embedding for Face Recognition and Clustering

2020-01-25T01:03:31Z

Learns a Euclidean embedding per image > Uses a deep CNN trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. > state-of-the-art face recognition performance using only **128-bytes per face**.

Paris NLP Season 4 Meetup #3 – Paris NLP (2020)

2020-01-23T22:26:20Z

- Siamese CNN for jobs-candidate matching: learning document embeddings with triplet loss. - Sesame street-based naming schemes must fade out, long live CamemBERT et le French fromage!

Semantic Text Matching for Long-Form Documents (2019)

2020-01-23T10:21:17Z

**A document can be represented as a hierarchy of paragraph, sentence and word sequences.** Different paragraphs and sentences can have different semantic meaning and importance. A multi-depth attention-based hierarchical RNN derive representations for each level of document structure, which are then aggregated to build a representation of the entire document Uses a Siamese structure for semantic text matching.

Building a Search Engine with BERT and TensorFlow - Towards Data Science

2020-01-12T17:13:45Z

[somewhat related](/doc/2020/01/elasticsearch_meets_bert_build)

Elasticsearch meets BERT: Building Search Engine with Elasticsearch and BERT

2020-01-10T17:23:50Z

- Links to [this ES blog post](/doc/2020/01/text_similarity_search_in_elast) - [somewhat related](/doc/2020/01/building_a_search_engine_with_b)

NLP's Clever Hans Moment has Arrived

2020-01-10T16:33:27Z

Do neural networks learn what we think they learn? @benbenhh reviews research that suggests that they often instead fall prey to the so-called Clever Hans effect and discusses its implications for NLP.

[2003.05473] Investigating Entity Knowledge in BERT with Simple Neural End-To-End Entity Linking (CoNNL 2019)

2020-01-09T10:36:17Z

Training BERT-base-uncased on English Wikipedia and then fine-tuned and evaluating it on an entity linking (EL) benchmark (EL implemented as a token classification over the entity vocabulary) > BERT+Entity is a straightforward extension on top of BERT, i.e. we initialize BERT with the publicly available weights from the BERT-base-uncased model and add an output classification layer on top of the architecture. Given a contextualized token, the classifier computes the probability of an entity link for each entry in the entity vocabulary. Can BERT’s architecture learn all entity linking steps jointly? To answer: > an extreme simplification of the **entity linking setup that works surprisingly well**: simply cast it as **a per token classification over the entire entity vocabulary** (over 700K classes in our case). > the model is the first that performs entity linking without any pipeline or any heuristics, compared to all prior approaches. We found that with our approach we can learn additional entity knowledge in BERT that helps in entity linking. **However, we also found that almost none of the downstream tasks really required entity knowledge**. ### Related work - > [Durrett and Klein (2014)](/doc/2020/01/a_joint_model_for_entity_analys) were the first to propose jointly modelling Mention detection, Candidate generation and Entity disambiguation in a graphical model and could show that each of those steps are interdependent and benefit from a joint objective This paper uses neural techniques instead of CRF. - > [Yamada](/showprop.do?pptyuri=http%3A%2F%2Fwww.semanlink.net%2F2001%2F00%2Fsemanlink-schema%23arxiv_author&pptyval=Ikuya%2BYamada) (2016, 2017) was the first to investigate neural text representations and entity linking, but their approach is limited to ED. cf. [#Wikipedia2Vec](tag:wikipedia2vec). Compare with [newer work by Yamada](doc:2020/09/1909_01259_neural_attentive_b)

Named Entity Recognition with Bert – Depends on the definition

2020-01-09T02:01:52Z

> how you can finetune the Bert model to do state-of-the art named entity recognition Same author: [NER with Lime](/doc/2020/01/interpretable_named_entity_reco)

[1902.10909] BERT for Joint Intent Classification and Slot Filling

2020-01-09T01:13:39Z

> Experimental results show that our proposed joint BERT model outperforms BERT models modeling intent classification and slot filling separately, demonstrating the efficacy of exploiting the relationship between the two tasks. Adding a CRF on top of the model doesn't improve the results.

Richer Sentence Embeddings using Sentence-BERT — Part I

2020-01-06T01:48:12Z

Simplistic (and often used) methods for sentence embeddings with BERT are too simplistic to be good (avearaging the word vectors, or using the \[CLS\] special vector (start of sequence). [About this paper](/doc/2019/08/_1908_10084_sentence_bert_sen)

Lecture 14 – Contextual Vectors | Stanford CS224U: Natural Language Understanding | Spring 2019

2020-01-05T18:17:47Z

Artificial Human Intelligence: The Programmer’s Apprentice - Tom Dean and Rishabh Singh - Google Research

2019-11-16T20:16:43Z

> Our primary objective is to build an end-to-end system for an individualized personal assistant that focuses on a specific area of expertise, namely software engineering, that learns from experience, works collaboratively with an expert programmer and that provides value from day one. > Our **goal in developing systems that incorporate characteristics of human intelligence** is two fold: humans provide a complete solution that we can use as a basic blueprint and then improve upon, and **the resulting AI systems are likely to be well suited to developing assistants** that complement and extend human intelligence while **operating in a manner comprehensible to our understanding**

[1807.00082] Amanuensis: The Programmer's Apprentice

2019-11-12T16:25:10Z

**The use of natural language to facilitate communication between the expert programmer and apprentice AI system.** > an overview of the material covered in a course taught at Stanford in the spring quarter of 2018. The course draws upon **insight from cognitive and systems neuroscience to implement hybrid connectionist and symbolic reasoning systems** that leverage and extend the state of the art in machine learning **by integrating human and machine intelligence**. As a concrete example we focus on digital assistants that learn from continuous dialog with an expert software engineer while providing initial value as powerful analytical, computational and mathematical savants. > [#Dehaene](/tag/stanislas_dehaene)'s work extends the [#Global Workspace Theory](/tag/global_workspace_theory) of Bernard Baars. Dehaene’s version of the theory combined with Yoshua Bengio’s concept of a [#consciousness prior](/tag/consciousness_prior.html) and deep reinforcement learning suggest a model for constructing and maintaining the cognitive states that arise and persist during complex problem solving.

CamemBERT

2019-11-10T18:08:18Z

language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the OSCAR multilingual corpus

[1911.01464] Emerging Cross-lingual Structure in Pretrained Language Models

2019-11-06T13:09:03Z

BERT is now part of Google Search, so let’s understand how it reasons

2019-10-31T08:28:40Z

Restoring ancient text using deep learning: a case study on Greek epigraphy | DeepMind

2019-10-18T00:50:20Z

Language and Perception in Deep Learning - Florian Strub DeepMind, Univ. Lille, Inria

2019-10-07T23:08:40Z

A [Related paper](/doc/2019/10/feature_wise_transformations)

Meet ALBERT: a new ‘Lite BERT’ from Google & Toyota with State of the Art NLP performance and 18x fewer parameters.

2019-10-01T15:21:13Z

Evolution of Representations in the Transformer (2019)

2019-09-16T22:02:56Z

Blog post about [this paper](http://127.0.0.1:8080/semanlink/doc/2019/09/_1909_01380_the_bottom_up_evol)

Introducing Neural Structured Learning in TensorFlow

2019-09-03T19:01:32Z

Neural Structured Learning (NSL) is an open source framework for training deep neural networks with structured signals. It implements Neural Graph Learning, which enables developers to train neural networks using graphs.

Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT

2019-08-28T22:47:20Z

[1908.10084] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

2019-08-28T22:41:55Z

> Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive **semantically meaningful sentence embeddings** that can be compared using cosine-similarity. Important because - BERT ist unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. - simple methods such as using the CLS token give low quality sentence embeddings However, the purpose of SBERT sentence embeddings are **not to be used for transfer learning for other tasks**. [Related blog post](/doc/2020/01/richer_sentence_embeddings_usin); [Github](https://github.com/UKPLab/sentence-transformers)

[1808.02590] A Tutorial on Network Embeddings

2019-08-25T02:02:16Z

Watch Your Step: Learning Node Embeddings via Graph Attention

2019-08-23T00:32:38Z

[1905.07129] ERNIE: Enhanced Language Representation with Informative Entities

2019-08-05T15:40:17Z

> We argue that informative entities in **KGs can enhance language representation with external knowledge**. In this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously. > ERNIE achieves significant improvements on various knowledge-driven tasks, and meanwhile is comparable with the state-of-the-art model BERT on other common NLP tasks [GitHub](https://github.com/thunlp/ERNIE) WARNING, there is another ERNIE (by [NLP@Baidu](tag:nlp_baidu)): Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representation through knowledge integration. This doesn't happen when you choose François-Paul as the name for your child.

What is XLNet and why it outperforms BERT - Towards Data Science

2019-08-02T17:46:14Z

A2N: Attending to Neighbors for Knowledge Graph Inference - ACL 2019

2019-07-31T19:37:20Z

> State-of-the-art models for knowledge graph completion aim at learning a fixed embedding representation of entities in a multi-relational graph which can generalize to infer unseen entity relationships at test time. This can be sub-optimal as it requires memorizing and generalizing to all possible entity relationships using these fixed representations. We thus propose a novel **attention-based method to learn query-dependent representation of entities** which adaptively combines the relevant graph neighborhood of an entity leading to more accurate KG completion.

BERT's success in some benchmarks tests may be simply due to the exploitation of spurious statistical cues in the dataset. Without them it is no better then random. : MachineLearning

2019-07-24T01:35:24Z

[1907.07355] Probing Neural Network Comprehension of Natural Language Arguments

2019-07-24T01:34:54Z

what has BERT learned about argument comprehension? [Comments](/doc/2019/07/bert_s_success_in_some_benchmar)

Google AI Blog: Harnessing Organizational Knowledge for Machine Learning (2019)

2019-06-28T02:00:39Z

how existing knowledge in an organization can be used as noisier, higher-level supervision—or, as it is often termed, weak supervision—to quickly label large training datasets Snorkel Drybell, experimental internal system, which adapts the opensource Snorkel framework to **use diverse organizational knowledge resources—like internal models, ontologies, legacy rules, knowledge graphs and more—in order to generate training data** for machine learning models at web scale. Enables writing **labeling functions** that label training data programmatically [paper](/doc/2019/06/_1812_00417_snorkel_drybell_a)

[1906.04341] What Does BERT Look At? An Analysis of BERT's Attention

2019-06-21T21:49:32Z

NLP: Contextualized word embeddings from BERT – Towards Data Science

2019-06-12T08:24:42Z

Hamiltonian Neural Networks

2019-06-11T11:51:14Z

> Even though neural networks enjoy widespread use, they still struggle to learn the basic laws of physics. How might we endow them with better inductive biases? In this paper, we draw inspiration from Hamiltonian mechanics to train models that learn and respect exact conservation laws in an unsupervised manner.

[1906.02715] Visualizing and Measuring the Geometry of BERT

2019-06-07T23:33:36Z

> At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations

"I made a bet that a Naive Bayes classifier would work as well on humor recognition as a neural net with fine-tuned Bert embeddings. I won"

2019-06-06T22:48:05Z

[Jeremy Howard's answer](https://forums.fast.ai/t/nlp-challenge-project/44153)

Introducing FastBert — A simple Deep Learning library for BERT Models

2019-05-23T08:23:28Z

Robust Language Representation Learning via Multi-task Knowledge Distillation - Microsoft Research

2019-05-19T23:16:17Z

Related to [this](/doc/?uri=https%3A%2F%2Farxiv.org%2Fabs%2F1901.11504).

[1905.05950] BERT Rediscovers the Classical NLP Pipeline

2019-05-18T17:50:08Z

> We find that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference. Qualitative analysis reveals that the model can and often does adjust this pipeline dynamically, revising lower-level decisions on the basis of disambiguating information from higher-level representations.

[1904.08398] DocBERT: BERT for Document Classification

2019-04-18T17:26:35Z

[1803.02893] An efficient framework for learning sentence representations

2019-03-20T17:47:59Z

"**Quick Thoughts**". Framework for learning sentence representations from unlabelled data. > we reformulate the problem of predicting the context in which a sentence appears as a classification problem.

Contrastive Unsupervised Learning of Semantic Representations: A Theoretical Framework – Off the convex path (2019-03)

2019-03-20T16:15:33Z

[paper](/doc/?uri=https%3A%2F%2Farxiv.org%2Fabs%2F1902.09229). Why do objectives similar the one used by word2vec succeed in such diverse settings? ("Contrastive Unsupervised Representation Learning" (CURL): **methods that leverage similar pairs of data points**) > In contrastive learning the objective used at test time is very different from the training objective: generalization error is not the right way to think about this. -> a framework that formalizes the notion of semantic similarity that is implicitly used by these algorithms > **if the unsupervised loss happens to be small at the end of contrastive learning then the resulting representations perform well on downstream classification**

huggingface/pytorch-pretrained-BERT: The Big-&-Extending-Repository-of-Transformers: Pretrained PyTorch models for Google's BERT, OpenAI GPT & GPT-2, Google/CMU Transformer-XL.

2019-03-15T22:38:21Z

François Chollet sur Twitter : a crash course on everything you need to know to use TensorFlow 2.0 + Keras

2019-03-12T22:48:43Z

[1901.11504] Multi-Task Deep Neural Networks for Natural Language Understanding

2019-02-17T12:30:18Z

outperforms BERT in nine of eleven benchmark NLP tasks

Using BERT for state-of-the-art pre-training for natural language processing

2019-02-14T16:45:56Z

Jacob Devlin talks about BERT at the Stanford NLP seminar

2019-02-11T11:20:39Z

Includes new results such as the effect of the masking strategy, using synthetic training data,...

Google explores AI's mysterious polytope | ZDNet

2019-02-09T01:52:31Z

Keywords2vec

2019-02-09T01:43:55Z

To generate a word2vec model, but using keywords instead of one word. Tokenize on stopwords + non word characters (This remembers me author of [FlashText algorithm](tag:flashtext_algorithm.html) saying he had developed it to create word2vec models)

Intelligence artificielle : DeepMind s’intéresse au jeu de cartes français Hanabi

2019-02-07T01:39:52Z

Romain Vial (Hyperlex) at Paris NLP meetup, slides

2019-01-24T17:21:48Z

> Hyperlex is a contract analytics and management solution powered by artificial intelligence. Hyperlex helps companies manage and make the most of their contract portfolio by identifying relevant information and data to manage key contractual commitments. > Take-home message: > > - Sentence representation starts to be well understood empirically > - Large document representation is still an open (and interesting) problem!

Training Cutting-Edge Neural Networks with Tensor2Tensor and 10 lines of code

2019-01-21T10:58:18Z

Durk Kingma sur Twitter : about likelihood-based generative models

2018-12-07T08:38:44Z

Durk Kingma sur Twitter > "It is my personal belief is that sufficiently powerful likelihood-based generative models will usher in a new era of machine learning, allowing us to tackle important limitations of current machine learning, such as lacking data efficiency and generalization. [7/8]"

The Kanerva Machine: A Generative Distributed Memory | OpenReview (2018)

2018-12-06T12:50:01Z

A generative memory model that combines slow-learning neural networks and a fast-adapting linear Gaussian model as memory

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar

2018-12-03T15:08:17Z

Google AI Blog: Google at EMNLP 2018

2018-11-25T15:14:25Z

Google AI Blog: Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

2018-11-05T15:13:01Z

GitHub - google-research/bert: TensorFlow code and pre-trained models for BERT

2018-11-05T15:04:06Z

Code and pretrained weights for BERT. Includes scripts to reproduce results. BERT-Base can be fine-tuned on a standard GPU; for BERT-Large, a Cloud TPU is required

Self-Governing Neural Networks for On-Device Short Text Classification - Sujith Ravi | Zornitsa Kozareva (2018)

2018-11-02T23:20:31Z

[same paper](https://aclweb.org/anthology/papers/D/D18/D18-1092/)

Time-Contrastive Networks: Self-Supervised Learning from Video (2017)

2018-10-27T14:59:43Z

Self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. > We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images. > This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm.

TensorFlow: how to load and save models at every epoch so you never lose time or data.

2018-10-26T16:31:02Z

[1703.03129] Learning to Remember Rare Events

2018-10-23T12:36:58Z

> a large-scale life-long memory module for use in deep learning. The module exploits fast nearest-neighbor algorithms for efficiency and thus scales to large memory sizes. Except for the nearest-neighbor query, the module is fully differentiable and trained end-to-end with no extra supervision. It operates in a life-long manner, i.e., without the need to reset it during training. > Our memory module can be easily added to any part of a supervised neural network

[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

2018-10-12T14:36:01Z

**The "Devlin et al 2019" paper** [Paper Dissected](https://datasciencetoday.net/index.php/en-us/nlp/211-paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained)

TensorFlow.js

2018-10-10T11:27:13Z

Training on TPU

2018-10-05T08:19:26Z

For what tasks is Pytorch preferable to Tensorflow? - Quora

2018-08-28T09:23:39Z

Simple guide to Neural Arithmetic Logic Units (NALU): Explanation, Intuition and Code

2018-08-21T17:25:23Z

a neural network model that can learn simple to complex numerical functions with great extrapolation (generalisation) ability

[1807.03748] Representation Learning with Contrastive Predictive Coding

2018-07-21T10:05:02Z

> a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding. The key insight of our model is to learn such representations by predicting the future in latent space by using powerful [autoregressive models](/tag/autoregressive_model). We use a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples. It also makes the model tractable by using [negative sampling](/tag/negative_sampling). a contrastive method that can be applied to any form of data that can be expressed in an ordered sequence: text, speech, video...

What is Candidate Sampling

2018-07-07T15:04:54Z

How sampling works in Word2vec? Can someone please make me understand NCE and negative sampling? - Cross Validated

2018-07-07T15:02:59Z

> In order to deal with the issue of the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation... **The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem.**

[1806.01261] Relational inductive biases, deep learning, and graph networks

2018-06-13T13:34:03Z

> generalizing beyond one's experiences--a hallmark of human intelligence from infancy--remains a formidable challenge for modern AI > A key signature of human intelligence is the ability to make infine use of finite means" (Humboldt, 1836; Chomsky, 1965) (ex: words / sentences > Here we explore how to improve modern AI's capacity for **combinatorial generalization** by biasing learning towards structured representations and computations, and in particular, systems that operate on graphs. (papier recommandé par [Peter Bloem](tag:peter_bloem))

Cloud TPU – Accélérateurs de ML pour TensorFlow | Google Cloud

2018-05-31T16:23:57Z

[1803.11175] Universal Sentence Encoder

2018-05-29T16:50:18Z

models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. > With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task mixes an unsupervised task using a large corpus together with the supervised SNLI task, leveraging the [#Transformer](/tag/attention_is_all_you_need) architecture

Module google/universal-sentence-encoder | TensorFlow

2018-05-23T16:35:31Z

[Paper presented at EMNLP 2018](https://aclanthology.coli.uni-saarland.de/papers/D18-2029/d18-2029)

Testing Tensorflow code

2018-05-21T12:04:22Z

How to use Dataset in TensorFlow – Towards Data Science

2018-04-21T11:41:38Z

Text Classification with TensorFlow Estimators

2018-04-17T14:19:22Z

Learning to write programs that generate images | DeepMind

2018-03-28T12:11:42Z

This ability to interpret objects through the tools that created them gives us a richer understanding of the world and is an important aspect of our intelligence.

[1803.05651] Word2Bits - Quantized Word Vectors

2018-03-20T17:36:21Z

We show that high quality quantized word vectors using 1-2 bits per parameter can be learned by introducing a quantization function into Word2Vec. We furthermore show that training with the quantization function acts as a regularizer

GitHub - anvaka/word2vec-graph: Exploring word2vec embeddings as a graph of nearest neighbors

2018-03-12T11:22:58Z

Codes of Interest: Using Bottleneck Features for Multi-Class Classification in Keras and TensorFlow

2018-03-04T16:49:06Z

GitHub - tensorflow/models: Models and examples built with TensorFlow

2018-02-28T23:55:28Z

A gentle introduction to Doc2Vec – ScaleAbout – Medium

2018-02-14T01:34:05Z

Explanation for Doc2Vec - Quora

2018-02-14T01:19:08Z

[1710.04099] Wembedder: Wikidata entity embedding web service

2018-02-13T19:14:37Z

web service for querying an embedding of entities in the Wikidata knowledge graph. The embedding is trained on the Wikidata dump using Gensim's Word2Vec implementation and a simple graph walk

[1712.09405] Advances in Pre-Training Distributed Word Representations

2017-12-29T20:52:48Z

> we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available pre-trained models that outperform the current state of the art by a large margin on a number of tasks

gensim/WMD_tutorial.ipynb

2017-12-23T14:12:41Z

Finding similar documents with Word2Vec and WMD (Word Mover’s Distance)

Machine Learning for Systems and Systems for Machine Learning (NIPS 2017)

2017-12-12T10:57:13Z

[1712.01208] The Case for Learned Index Structures

2017-12-11T19:25:09Z

> we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs > > Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.

Implementing a CNN for Text Classification in TensorFlow – WildML

2017-11-06T18:56:50Z

Spell Checker using Word2vec | Kaggle

2017-11-03T10:46:08Z

LSTM with word2vec embeddings | Kaggle

2017-10-25T15:50:14Z

Using Gensim Word2Vec Embeddings in Keras | Ben Bolte's Blog

2017-10-23T09:05:11Z

Recurrent neural networks and LSTM tutorial in Python and TensorFlow - Adventures in Machine Learning

2017-10-23T08:53:16Z

A Word2Vec Keras tutorial

2017-10-23T01:22:35Z

Installing TensorFlow on Mac OS X | TensorFlow

2017-10-23T00:19:06Z

AlphaGo Zero: Learning from scratch | DeepMind

2017-10-18T22:43:19Z

Intelligence artificielle : toujours plus puissant, AlphaGo apprend désormais sans données humaines

2017-10-18T22:38:12Z

Tensorflow sucks

2017-10-16T14:34:28Z

see [What do people think of the TensorFlow sucks article? on Quora](https://www.quora.com/What-do-people-think-of-the-TensorFlow-sucks-article)

Distributed Word Representations for Information Retrieval

2017-10-01T19:10:39Z

includes description of word2vec

TensorFlow Neural Machine Translation (seq2seq) Tutorial

2017-09-18T14:14:51Z

Word2Vec Resources · Chris McCormick

2017-09-12T12:21:25Z

Word2Vec Tutorial Part 2 - Negative Sampling · Chris McCormick

2017-09-10T17:23:52Z

the tweaks to make training feasible

Word2Vec Tutorial - The Skip-Gram Model · Chris McCormick

2017-09-10T17:16:26Z

skip-gram

How does word2vec work? Can someone walk through a specific example? - Quora

2017-08-28T16:26:41Z

Vector Representations of Words | TensorFlow

2017-08-28T15:41:07Z

[1507.07998] Document Embedding with Paragraph Vectors

2017-08-20T23:29:27Z

An overview of word embeddings and their connection to distributional semantic models - AYLIEN (2016)

2017-07-20T15:43:09Z

> While on the surface DSMs and word embedding models use varying algorithms to learn word representations – the former count, the latter predict – both types of model fundamentally act on the same underlying statistics of the data, i.e. the co-occurrence counts between words... > These results are in contrast to the general consensus that word embeddings are superior to traditional methods. Rather, they indicate that it typically makes no difference whatsoever whether word embeddings or distributional methods are used. What really matters is that your hyperparameters are tuned and that you utilize the appropriate pre-processing and post-processing steps.

More Fun With Word Vectors - Bag of Words Meets Bags of Popcorn | Kaggle

2017-07-20T14:56:22Z

> We found that the code above gives about the same (or slightly worse) results compared to the Bag of Words

Can I use word2vec representation to train a weka classifier? - Quora

2017-07-20T13:45:20Z

Can I use word2vec to train a machine learning classifier? - Quora

2017-07-20T13:42:49Z

Some pre-trained word2vec models for French

2017-07-20T13:00:27Z

[1405.4053] Distributed Representations of Sentences and Documents

2017-07-10T16:20:03Z

Paragraph Vector: an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.Represents each document by a dense vector which is trained to predict words in the document. Overcomes the weaknesses of the [Bag Of Words](/tag/bag_of_words) model (order of words, semantic of words)

An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec

2017-06-09T17:48:39Z

Types of word embeddings: - Frequency based Embedding - Count Vector - TF-IDF Vector - Co-Occurrence Vector - Co_occurence matrix (with a fixed context window), size V*V or V * N (Vocab size * subset of V size) matrix. - PCA or SVD: keeping the k most important eigenvalues - Prediction based Embedding - CBOW (Continuous Bag Of Words). 1 hidden layer, one output layer. Predict the probability of a word given a context - Skip-gram. Predict the proba of the context given a word Sample code using gensim

word2vec-api

2017-06-09T17:24:25Z

Simple web service providing a word embedding API. The methods are based on Gensim Word2Vec implementation.
List of word2vec datasets

gensim: models.word2vec – Deep learning with word2vec

2017-06-01T13:05:30Z

Word2vec in gensim Tutorial | RaRe Technologies

2017-06-01T02:22:33Z

alternatives to word2vec? - Quora

2017-05-23T15:06:24Z

Improving Topic Models with Latent Feature Word Representations | Nguyen | Transactions of the Association for Computational Linguistics

2017-05-20T14:05:12Z

Using Word2Vec for topic modeling - Stack Overflow

2017-05-19T00:22:06Z

Text Classification With Word2Vec - DS lore (2016)

2017-05-18T23:42:46Z

> Overall, we won’t be throwing away our SVMs any time soon in favor of word2vec but it has it’s place in text classification. > > 1. SVM’s are pretty great at text classification tasks > 2. Models based on simple averaging of word-vectors can be surprisingly good too (given how much information is lost in taking the average) > 3. but they only seem to have a clear advantage when there is ridiculously little labeled training data > > Update 2017: actually, the best way to utilise the pretrained embeddings would probably be this [using keras](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) Sample code to benchmark a few text categorization models to test whehter word embeddings like word2vec can improve text classification accuracy. Sample code (based on scikit-learn) includes an embedding vectorizer that is given embedding dataset and vectorizes texts by taking the mean of all the vectors corresponding to individual words.

Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors (2014)

2017-05-18T23:30:46Z

(good presentation in the intro of context-counting vs. context-predicting vectors)

How is GloVe different from word2vec? - Quora

2017-05-18T23:20:04Z

Both learn geometrical encodings (vectors) of words from their co-occurrence information. Word2vec is a "predictive" model, whereas GloVe is a "count-based" model.

Google's neural networks invent their own encryption | New Scientist

2016-11-06T01:56:28Z

Using Cognonto to Generate Domain Specific word2vec Models | Frederick Giasson

2016-09-29T08:43:15Z

creating domain-specific training corpuses to use with word2vec can have a dramatic impact on the results and how results can be much more meaningful within the scope of that domain. Another advantage of the domain-specific training corpuses is that they create much smaller models.

Understanding neural networks with TensorFlow Playground | Google Cloud Big Data and Machine Learning Blog | Google Cloud Platform

2016-07-27T10:05:31Z

Go humans: Lee Sedol scores first victory against supercomputer | World news | The Guardian

2016-03-13T20:28:38Z

The Sadness and Beauty of Watching Google’s AI Play Go | WIRED

2016-03-11T21:01:33Z

2Vec or Not 2Vec?

2016-03-05T14:37:01Z

Word2vec: Neural Word Embeddings in Java - Deeplearning4j: Open-source, distributed deep learning for the JVM

2016-02-26T13:01:35Z

Mini AI app using TensorFlow and Shiny – Opiate for the masses

2016-01-15T01:15:01Z

[1301.3781] Efficient Estimation of Word Representations in Vector Space

2016-01-13T23:07:45Z

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

Game-playing software holds lessons for neuroscience : Nature News & Comment

2016-01-12T18:33:23Z

How friendly is your AI? It depends on the rewards | Robohub

2016-01-09T00:50:37Z

TensorFlow is Terrific – A Sober Take on Deep Learning Acceleration

2016-01-07T00:43:58Z

Simple end-to-end TensorFlow examples | Bcomposes

2015-12-21T19:05:46Z

Research Blog: TensorFlow - Google’s latest machine learning system, open sourced for everyone

2015-11-09T18:52:15Z

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

2015-11-09T18:49:56Z

Artificial General Intelligence that plays Atari video games: How did DeepMind do it? | Robohub

2014-09-26T16:38:02Z

Google Introduces New Self Driving Car at the Code Conference | Re/code

2014-05-28T13:24:24Z

Google’s Self-Driving Cars Are Going to Change Everything (Vancouver Data Blog by Neil McGuigan)

2013-09-05T11:47:43Z

Google investit dans le service de taxis Uber

2013-08-26T23:28:51Z

Google’s Driver-less Car and Morality : The New Yorker

2012-11-30T22:26:18Z

“Ethical subroutines” may sound like science fiction, but once upon a time, so did self-driving cars.

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

2012-05-22T12:04:25Z

Google Translator: The Universal Language

2005-05-31

AI@Google

LlamaIndex 🦙 sur X : "Fine-tuning Embedding Models for RAG with LoRA'

[2404.11018] Many-Shot In-Context Learning

[2307.15936] A Theory for Emergence of Complex Skills in Language Models

Jeff Dean (@🏡) sur X : "Gemini 1.5 Pro - A highly capable multimodal model with a 10M token context length..."

An efficient long-text semantic retrieval approach via utilizing presentation learning on short-text | Complex & Intelligent Systems (2023)

Rachit Bansal sur X : "An LLM can be efficiently *composed* with specialized (L)LMs to enable new tasks"

Maarten Grootendorst sur X : "BERTopic + LLMs + DataMapPlot"

UKP Lab sur X : "a lightweight solution for few-shot domain-specific sentence classification: AdaSent!..."

Rethinking Query Expansion for BERT Reranking | Advances in Information Retrieval (2020)

Maarten Grootendorst sur X : "Introducing KeyLLM. An extension to KeyBERT that can create, extract, and fine-tune keywords using Large Language Models!

Getting started with DeepMatcher.ipynb - Colaboratory

[2002.06275] TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval

Modular and Parameter-Efficient Fine-Tuning for NLP Models

SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval

[2305.14128] Dr.ICL: Demonstration-Retrieved In-context Learning

Generative AI support on Vertex AI generally available | Google Cloud Blog

Daniel Daza sur Twitter : "BioBLP, a method for learning embeddings on multimodal knowledge graphs...."

[2305.11778] Cross-Lingual Supervision improves Large Language Models Pre-training

Peter J. Liu sur Twitter : "RLHF-alternative without RL"

[2305.06897] AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

Google AI PaLM 2 – Google AI

Google teases Project Tailwind — a prototype AI notebook that learns from your documents - The Verge

skeskinen/bert.cpp: ggml implementation of BERT

Document AI | Google for Developers - Software Development Guides, Tools & More | Google Developers

Niels Rogge sur Twitter : "Made some new demo notebooks! - fine-tune @MetaAI's SAM and @GoogleAI's Pix2Struct on custom data"

Google "We Have No Moat, And Neither Does OpenAI"

Aran Komatsuzaki sur Twitter : "JaxPruner: A concise library for sparsity research An open-source JAX-based pruning and sparse training library for machine learning research repo"

[2303.16839] MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Domain Adaptation with Generative Pseudo-Labeling (GPL) | Pinecone

Classifying long textual documents (up to 25 000 tokens) using BERT | by Sinequa | (2020)

Diffusion language models – Sander Dieleman

Daniel Vila Suero sur Twitter : "Data quality is key for LLMs, but we're building Open Source LLMs with data of "unknown" quality... Introducing Alpaca GarbageCollector..."

[2304.01982] Rethinking the Role of Token Retrieval in Multi-Vector Retrieval

Niels Rogge sur Twitter : "@GoogleAI's Pix2Struct now available in 🤗 Transformers!"

Enabling Python VirtualEnv in JupyterLab | My Shitty Code

[2112.05682] Self-attention Does Not Need O(n^2) Memory

[2108.08877] Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

Maarten Grootendorst sur Twitter : "The v0.14 release of BERTopic is here. Fine-tune your topic keywords and labels with models from @OpenAI, @huggingface, @CohereAI, @spacy_io, and @LangChainAI... An overview thread"

Jim Fan sur Twitter : "Do you know that DeepMind has actually open-sourced the heart of AlphaGo & AlphaZero?... "

Guiding Frozen Language Models with Learned Soft Prompts – Google AI Blog

[2203.14465] STaR: Bootstrapping Reasoning With Reasoning

Google announces ChatGPT rival Bard, with wider availability in ‘coming weeks’ - The Verge

Ramsri Goutham Golla sur Twitter : "The most practical open-source competitor to @OpenAI 's GPT-3 is Google's Flan-T5 Here are 5 Flan-T5 resources to try out easily, deploy, or fine-tune it! 🧵" / Twitter

The Flan Collection: Advancing open source methods for instruction tuning – Google AI Blog

Shayne Longpre sur Twitter : "What’s the best completely public competitor to #ChatGPT? Flan-T5 beats all public models we tested..."

Créer un notebook JupyterLab Vertex AI | Google Cloud

5 steps to go from a notebook to a deployed model — The TensorFlow Blog

LaMDA: our breakthrough conversation technology

An empirical analysis of compute-optimal large language model training

Characterizing Emergent Phenomena in Large Language Models – Google AI Blog

[2301.08210] Everything is Connected: Graph Neural Networks

Multilingual Sentence Transformers | Pinecone

AlphaFold’s new rival? Meta AI predicts shape of 600 million proteins

Andrej Karpathy sur Twitter : "Great post (5mo ago) "chinchilla's wild implications" giving context to LLM goldrush shifting from model size to dataset size..."

Rohan Anil sur Twitter : "Next big jump with Neural Network performance is going to happen when community embraces non-uniformity

ValueError "invalid literal for int() with base 10" in trainer.evaluate (dataset created from pandas) · Issue #228 · huggingface/setfit

Few-Shot Text Classification (Cloudera 2020)

One of the Biggest Problems in Biology Has Finally Been Solved - Scientific American

[2202.06991] Transformer Memory as a Differentiable Search Index

Tutorial on Uncertainty Estimation for NLP

Stephanie Chan sur Twitter : "Transformer inductive biases..."

Lewis Tunstall sur Twitter : "The SetFit library for few-shot learning with Sentence Transformers now supports *multi-label text classification*..."

huggingface/setfit: Efficient few-shot learning with Sentence Transformers

Santiago sur Twitter : "If you have an Apple M1 or M2 and don't take advantage of its GPU, I'm about to change your life..."

MaartenGr/KeyBERT: Minimal keyword extraction with BERT

Yi Tay sur Twitter : "Don't retrieve, recite!..."

[2205.11498] Domain Adaptation for Memory-Efficient Dense Retrieval

[2209.11055] Efficient Few-Shot Learning Without Prompts

Google AI Blog: TensorStore for High-Performance, Scalable Array Storage

PromptBERT improving BERT sentence embeddings with prompts - Ethan Kim

[2201.04337] PromptBERT: Improving BERT Sentence Embeddings with Prompts

Prompt Tuning BERT🎯:CommonLit Readability | Kaggle

Active Learning for BERT: An Empirical Study - ACL Anthology

[2106.10199] BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

On Stability of Few-Sample Transformer Fine-Tuning | Kaggle

Unsupervised Learning — Sentence-Transformers documentation

Train and Fine-Tune Sentence Transformers Models

[2205.00820] Entity-aware Transformers for Entity Search

Leshem Choshen sur Twitter : "Computational (Chomskian) hierarchies can predict OOD capabilities..."

Rachit Bansal sur X : "An LLM can be efficiently composed with specialized (L)LMs to enable new tasks"

Lewis Tunstall sur Twitter : "The SetFit library for few-shot learning with Sentence Transformers now supports multi-label text classification..."

Chris Olah sur Twitter : "I'm excited to finally be making progress on understanding the first MLP layer in large transformer LMs. I've tried really hard and prior to SoLU had little success." / Twitter