Semanlink - [2305.12517] Retrieving Texts based on Abstract Descriptions

Tags:

About This Document

sl:arxiv_author :
sl:arxiv_firstAuthor : Shauli Ravfogel
sl:arxiv_num : 2305.12517
sl:arxiv_published : 2023-05-21T17:14:31Z
sl:arxiv_summary : In this work, we aim to connect two research areas: instruction models and retrieval-based models. While instruction-tuned Large Language Models (LLMs) excel at extracting information from text, they are not suitable for semantic retrieval. Similarity search over embedding vectors allows to index and query vectors, but the similarity reflected in the embedding is sub-optimal for many use cases. We identify the task of retrieving sentences based on abstract descriptions of their content. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting an a large language model (LLM). While it is easy to source the training material from an LLM, the retrieval task cannot be performed by the LLM directly. This demonstrates that data from LLMs can be used not only for distilling more efficient specialized models than the original LLM, but also for creating new capabilities not immediately possible using the original model.@en
sl:arxiv_title : Retrieving Texts based on Abstract Descriptions@en
sl:arxiv_updated : 2023-05-21T17:14:31Z
sl:bookmarkOf : https://arxiv.org/abs/2305.12517
sl:creationDate : 2023-06-15
sl:creationTime : 2023-06-15T19:09:12Z
sl:relatedDoc : http://www.semanlink.net/doc/2023/05/ل_ل_yoav_👾_sur_twit

File info

Bookmark of: https://arxiv.org/abs/2305.12517

Linked From

@yoavgo sur Twitter : "searching by description can be very useful, but current embedding models will give you texts that are *similar* to the description, rather than texts that *adhere to it*..."

Tags:

2023-05-24 About

Documents with similar tags (experimental)

@yoavgo sur Twitter : "searching by description can be very useful, but current embedding models will give you texts that are *similar* to the description, rather than texts that *adhere to it*..."

Tags:

2023-05-24 About

Google "We Have No Moat, And Neither Does OpenAI"

Tags:

> low-cost public involvement was enabled
by a vastly cheaper mechanism for fine tuning called low
rank adaptation ()[LoRA](tag:lora)

> **Part of what makes LoRA so effective is that ... it’s stackable.**
>
> By contrast, training giant models from scratch not only
throws away the pretraining, but also any iterative
improvements that have been made on top.

> LoRA updates are very cheap to produce (~$100) for the
most popular model sizes.

> Many of these projects are saving time by training on
small, highly curated datasets...
> These
datasets are built using synthetic methods (e.g. filtering
the best responses from an existing model) and
scavenging from other projects

> Directly Competing With Open Source
Is a Losing Proposition

> Paradoxically, the one clear winner in all of this is Meta.
Because the leaked model was theirs ([LLaMA](tag:llama)), they have
effectively garnered an entire planet's worth of free labor.
Since most open source innovation is happening on top of
their architecture, there is nothing stopping them from
directly incorporating it into their products.

2023-05-04 About

[2212.09741] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

Tags:

2023-02-17 About