]> « L’écart de PIB est désormais de 80 % entre l’Europe et les Etats-Unis » 2023-09-06T08:06:56Z 2023-09-06 Comparing Files in Visual Studio Code 2023-09-08 2023-09-08T18:03:05Z 2023-09-13 2023-09-13T09:45:14Z Fine-Tuning Your Embedding Model to Maximize Relevance Retrieval in RAG Pipeline | by Wenqi Glantz | Sep, 2023 | Better Programming see also [Jerry Liu sur X : "One major way to improve your RAG system is to fine-tune your embedding model"](doc:2023/08/jerry_liu_sur_x_one_major_wa) 2023-09-22T12:58:30Z 2023-09-22 SPLADE for Sparse Vector Search Explained | Pinecone 2023-09-10 Les dernières heures d'Allende > "Pourquoi n'y aura-t-il jamais de coup d'Etat aux Etats-Unis ? Parce qu'il n'y a pas d'ambassade américaine dans ce pays." 2023-09-10T13:10:05Z 2023-09-06 2023-09-06T08:18:00Z Paul Graham sur X : "If you go to chrome://settings/adPrivacy you can turn off the spyware that got inserted into the latest version of Chrome." > modules that **use LLMs for decision making capabilities**. They can be used for the following use cases and more: > - Selecting the right data source among a diverse range of data sources > - Deciding whether to do summarization (e.g. using summary index query engine) or semantic search (e.g. using vector index query engine) > - etc. 2023-09-18 2023-09-18T22:17:17Z Routers - LlamaIndex 🦙 0.8.29.post1 Jeremy Howard sur X : "I just uploaded a 90 minute tutorial, which is designed to be the one place I point coders at when they ask "hey, tell me everything I need to know about LLMs!" 2023-09-24 2023-09-24T12:49:04Z 2023-09-23T07:52:07Z 2023-09-23 Andrew Trask sur X : (about "Does a language model trained on “A is B” generalize to “B is A”?") Pourquoi il faut changer la règle européenne concernant les agences réglementaires. Ce n'est pas la méthode scientifique qui est en cause, c'est la loi selon laquelle les agences "fondent leurs avis" sur des tests fournis par les industriels. [tweet](https://x.com/hyperfp/status/1705894256437530982?s=20) « Le dossier glyphosate illustre jusqu’à la caricature le conflit entre agences réglementaires et institutions scientifiques » 2023-09-24 2023-09-24T12:03:34Z anhaidgroup/deepmatcher: Python package for performing Entity and Text Matching using Deep Learning. 2023-09-20T08:39:07Z 2023-09-20 Shengju Qian 2023-09-21T17:59:11Z 2309.12307 2023-09-21T17:59:11Z Haotian Tang 2023-09-26T22:59:12Z Jiaya Jia Song Han [2309.12307] LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models Xin Lai [github](https://github.com/dvlab-research/LongLoRA) Yukang Chen We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs. 2023-09-26 Zhijian Liu Yukang Chen 2023-09-23T13:05:45Z 2023-09-23 Le pape et les migrants, un rappel bienvenu à l’humanité Vive le pape Avec la pénurie d’eau, Mayotte s’enfonce dans une crise « hors-norme » : « Ce n’est plus vivable. Les nerfs vont lâcher » 2023-09-30T19:51:35Z 2023-09-30 > Santé, école, agriculture… La sécheresse inédite sur l’archipel français de l’océan Indien a des répercussions qui exaspèrent la population. Les restrictions vont être encore plus sévères dans les semaines à venir, avec un scénario envisagé de distribution d’eau un jour sur quatre. Hum: "I think Phi-1.5 trained on the benchmarks" [src](https://x.com/suchenzang/status/1701615026648605095?s=20) 2023-09-12 2023-09-12T08:29:11Z Sebastien Bubeck sur X : "How far does one billion parameters take you? ... Releasing phi-1.5, a 1.3B parameter LLM exhibiting emergent behaviors surprisingly close to much larger LLMs" Rohan sur X : "smaller chunks are good for capturing semantic meaning and larger ones are good for providing better context. @llama_index AutoMergingRetriever takes it one step further..." 2023-09-30T10:39:57Z 2023-09-30 Jerry Liu sur X : "evaluating RAG: purely evaluating retrieval metrics (MRR, precision) isn’t the whole picture - you need end-to-end response evals..." 2023-09-26T23:49:41Z 2023-09-26 2023-09-28 2023-09-28T08:20:42Z Yam Peleg sur X : "Qwen-14B (Alibaba) The most powerful open-source model for it's size. And the longest trained: 3T tokens..." 2023-09-11 Climat : « le futur dystopique est déjà là », alerte l’ONU 2023-09-11T15:13:47Z Yann LeCun sur X : "...Future AI assistants will mediate everyone's interaction with the digital world. This is way too foundational and powerful to be proprietary..." 2023-09-21T23:47:34Z 2023-09-21 Publikationen der UdS: Natural language processing for African languages 2023-09-02 2023-09-02T15:53:39Z 2023-09-30 2023-09-30T10:45:40Z Percer les secrets des génomes marins | CNRS Le journal 2023-09-14T01:27:03Z 2023-09-14 Bacuri Jerry Liu sur X : "seven full ways to query knowledge graphs with LLMs..." 2023-09-30 2023-09-30T09:42:17Z seven full ways to query knowledge graphs with LLMs 2023-09-30 Babi Yar, 1941. Le massacre des Juifs de Kiev restitué dans un documentaire exceptionnel | CNRS Le journal 2023-09-30T10:55:50Z Jerry Liu sur X : "A simple trick to improve retrieval for RAG 💡: Embed “references” to each text chunk instead of the chunk itself (e.g. smaller chunks, summaries)..." 2023-09-06 2023-09-06T08:31:35Z > instruct-tuned models are better at generalizing the task to new data 2023-09-26 2023-09-26T22:49:46Z Bindu Reddy sur X : "The Ongoing Case For Open Source LLMs..." Same [small] improvement as in [openai-cookbook/examples/Customizing_embeddings.ipynb](doc:2023/09/openai_cookbook_examples_custom) > The linear adapter is simply a linear transformation that specifically transforms the query embedding while keeping document embeddings fixed. > - Generate a synthetic question-context dataset for both training and evaluation. > - Fine-tuning our linear adapter on top of an existing model (e.g. SBERT) 2023-09-18T23:17:30Z 2023-09-18 openai-cookbook/examples/Customizing_embeddings.ipynb 2023-09-17 2023-09-17T00:58:05Z > This notebook demonstrates **one way to customize OpenAI embeddings to a particular task**. > > The input is training data in the form of [text_1, text_2, label] where label is +1 if the pairs are similar and -1 if the pairs are dissimilar. > > The output is a matrix that you can use to multiply your embeddings. The product of this multiplication is a 'custom embedding' that will better emphasize aspects of the text relevant to your use case. [Comment](https://twitter.com/yoavgo/status/1702992422345621566) by [Yoav Goldberg](tag:yoav_goldberg): > there were a bunch of papers like this (using word embeddings) in xACL some years ago. one possible reaction: oh why dont they cite the previous work? another possible reaction: **maybe we shouldnt publish so many papers about obvious things**. Fine-Tuning a Linear Adapter for Any Embedding Model | LlamaIndex Blog | Sep, 2023 mentions [Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection](doc:2023/09/2309_06131_annotating_data_fo) > why is ColBERT so data-efficient? > Answer: > > - ColBERT neither needs to learn how to condense each document (unlike DPR) > - nor how to do matching (unlike MonoBERT). > - Just needs to learn contextual term representations—a much lower burden on the encoders. Omar Khattab sur X : "This isn't the main point of this great new paper by @sophiaalthammer et al. But it's incredible how ColBERT at 1000 training queries is better than DPR trained at *50,000* queries!" 2023-09-14T17:48:29Z 2023-09-14 Sebastian Hofstätter Suzan Verberne Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection [2309.06131] Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection Allan Hanbury 2309.06131 compare les Sentence Transformers, les cross encoders et Colbert dans le cadre low resource > "optimal'' subsets of training data that provide high effectiveness at low annotation cost do exist, but current mainstream AL strategies applied to PLM rankers are not capable of identifying them. 2023-09-12T11:17:42Z Sophia Althammer Search methods based on Pretrained Language Models (PLM) have demonstrated great effectiveness gains compared to statistical and early neural ranking models. However, fine-tuning PLM-based rankers requires a great amount of annotated training data. Annotating data involves a large manual effort and thus is expensive, especially in domain specific tasks. In this paper we investigate fine-tuning PLM-based rankers under limited training data and budget. We investigate two scenarios: fine-tuning a ranker from scratch, and domain adaptation starting with a ranker already fine-tuned on general data, and continuing fine-tuning on a target dataset. We observe a great variability in effectiveness when fine-tuning on different randomly selected subsets of training data. This suggests that it is possible to achieve effectiveness gains by actively selecting a subset of the training data that has the most positive effect on the rankers. This way, it would be possible to fine-tune effective PLM rankers at a reduced annotation budget. To investigate this, we adapt existing Active Learning (AL) strategies to the task of fine-tuning PLM rankers and investigate their effectiveness, also considering annotation and computational costs. Our extensive analysis shows that AL strategies do not significantly outperform random selection of training subsets in terms of effectiveness. We further find that gains provided by AL strategies come at the expense of more assessments (thus higher annotation costs) and AL strategies underperform random selection when comparing effectiveness given a fixed annotation cost. Our results highlight that ``optimal'' subsets of training data that provide high effectiveness at low annotation cost do exist, but current mainstream AL strategies applied to PLM rankers are not capable of identifying them. 2023-09-14 Guido Zuccon 2023-09-14T00:47:05Z Sophia Althammer 2023-09-12T11:17:42Z Dans les terres rouges de la savane du Cerrado, au Brésil, l’agro-industrie dévore et détruit tout 2023-09-15 2023-09-15T18:16:46Z Nitesh V. Chawla Yijun Tian Large Language Models (LLMs) have shown remarkable generalization capability with exceptional performance in various language modeling tasks. However, they still exhibit inherent limitations in precisely capturing and returning grounded knowledge. While existing work has explored utilizing knowledge graphs to enhance language modeling via joint training and customized model architectures, applying this to LLMs is problematic owing to their large number of parameters and high computational cost. In addition, how to leverage the pre-trained LLMs and avoid training a customized model from scratch remains an open question. In this work, we propose Graph Neural Prompting (GNP), a novel plug-and-play method to assist pre-trained LLMs in learning beneficial knowledge from KGs. GNP encompasses various designs, including a standard graph neural network encoder, a cross-modality pooling module, a domain projector, and a self-supervised link prediction objective. Extensive experiments on multiple datasets demonstrate the superiority of GNP on both commonsense and biomedical reasoning tasks across different LLM sizes and settings. Yijun Tian > Can we learn beneficial knowledge from KGs and integrate them into pre-trained LLMs? > we propose to leverage the factual knowledge from KGs to enhance LLMs, while still benefiting from circumventing the burdensome training expenses by using pre-trained LLMs > Graph Neural Prompting (GNP), a plug-and-play method to assist pre-trained LLMs in learning beneficial knowledge from KGs > > GNP encodes the pertinent grounded knowledge and complex structural information to derive Graph Neural Prompt, an embedding vector that can be sent into LLMs to provide guidance and instructions > - GNP first utilizes a GNN to capture and encode the intricate graph knowledge into **entity/node embeddings**. > - Then, a cross-modality pooling module is present to determine the **most relevant node embeddings in relation to the text input**, and consolidate these node embeddings into **a holistic graph-level embedding**. > - After that, GNP encompasses a **domain projector** to bridge the inherent disparities between the graph and text domains. > - Finally, a **self-supervised link prediction objective** is introduced to enhance the model comprehension of relationships between entities and capture graph knowledge in a self-supervised manner. 2023-09-28 Haozhu Wang 2023-09-27T06:33:29Z 2309.15427 Zichen Wang Graph Neural Prompting with Large Language Models 2023-09-28T08:52:07Z [2309.15427] Graph Neural Prompting with Large Language Models Ziqing Hu Huan Song 2023-09-27T06:33:29Z Panpan Xu Fang Wang > The "Boolformer" takes as input a set of N (x,y) pairs in {0,1}^D x {0,1}, and **tries to predict a Boolean formula which approximates these observations**. 2023-09-26T23:02:51Z 2023-09-26 Stéphane d'Ascoli sur X : "Think Transfomers are terrible at logical reasoning? Think again. Transformers trained with Boolean inputs and symbolic outputs..." *privacy not included | It’s Official: Cars Are the Worst Product Category We Have Ever Reviewed for Privacy | Mozilla Foundation 2023-09-07 2023-09-07T00:20:01Z « Tout était blanc dans ma tête, comme une grenade assourdissante » : le témoignage d’un blessé grave, à Paris, pendant les émeutes 2023-09-11 2023-09-11T15:10:50Z How to Optimize Retrieval-Augmented Generation > We all know that RAG is the killer application for LLMs but did you know that it doesn't work (out of the box)? 2023-09-08T01:03:03Z 2023-09-08 2023-09-30T14:26:24Z Maarten Grootendorst sur X : "Introducing KeyLLM. An extension to KeyBERT that can create, extract, and fine-tune keywords using Large Language Models! 2023-09-30 2023-09-05T21:41:17Z 2023-09-05 Des zones entières du Nigeria sous la coupe des « bandits » et de chefs de guerre Philipp Schmid sur X : “YaRN” allows you to scale LLMs like llama 2 to over 100k context!... 2023-09-01 2023-09-01T09:18:26Z 2023-09-28 2023-09-28T09:01:50Z Guillaume Lample sur X : "Mistral 7B is out. It outperforms Llama 2 13B on every benchmark we tried..." 2023-09-27T00:03:55Z 2023-09-27 Evaluation - LlamaIndex ModuleFormer: Modularity Emerges from Mixture-of-Experts Yikang Shen Zheyu Zhang Shawn Tan 2023-09-16 Tianyou Cao Chuang Gan [2306.04640] ModuleFormer: Modularity Emerges from Mixture-of-Experts 2306.04640 2023-09-16T00:15:56Z Large Language Models (LLMs) have achieved remarkable results. However, existing models are expensive to train and deploy, and it is also difficult to expand their knowledge beyond pre-training data without forgetting previous knowledge. This paper proposes a new neural network architecture, ModuleFormer, that leverages modularity to improve the efficiency and flexibility of large language models. ModuleFormer is based on the Sparse Mixture of Experts (SMoE). Unlike the previous SMoE-based modular language model, which requires domain-labeled data to learn domain-specific experts, ModuleFormer can induce modularity from uncurated data with its new load balancing and concentration losses. ModuleFormer is a modular architecture that includes two different types of modules: new stick-breaking attention heads and feedforward experts. Different modules are sparsely activated conditions on the input token during training and inference. In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task and the task-unrelated modules could be easily pruned for a lightweight deployment. 2023-06-07T17:59:57Z Yikang Shen > a new neural network architecture, ModuleFormer, that leverages modularity to improve the efficiency and flexibility of large language models. [GitHub](https://github.com/IBM/ModuleFormer) 2023-09-11T19:31:26Z Zhenfang Chen "un cousin de semanlink", dixit Raphaël > Welcome to my knowledge base (GM-RKB): A semantic wiki with approximately ~38,118 pages that largely focus on concepts (~25,000), publications (~5,000) or peoples (~1,000) (à comparer à mes 14475 documents et 7165 tags dans ma version privée de semanlink, as of today 2023-09-08) [RAG](https://www.gabormelli.com/RKB/Retrieval-Augmented_Natural_Language_Generation_(RAG)_Algorithm) vs sl's[RAG (Retrieval-Augmented Generation)](tag:retrieval_augmented_generation) Search for page containing ["RAG finetuning"](https://www.gabormelli.com/RKB/index.php?search=RAG+finetuning&title=Special%3ASearch&profile=advanced&fulltext=1&ns0=1) vs Search tag "RAG finetuning" returning [LLM Fine-tuning vs RAG](tag:llm_fine_tuning_vs_rag) Publie (hum !) le texte des articles arxiv converti en HTML 2023-09-08T01:41:28Z 2023-09-08 Gabor Melli's Knowledge Base (GM-RKB) 2023-09-06T08:27:04Z 2023-09-06 Jeremy Howard sur X : "It looks like @johnowhitaker & I may have found something crazy: LLMs can nearly perfectly memorise from just 1-2 examples!" 2023-09-24T17:39:52Z 2023-09-24 Osiris-Rex : la capsule contenant les échantillons d’astéroïde prélevés par la sonde a atterri Crise des subprimes : en septembre 2008, le capitalisme perdait pied 2023-09-15T08:55:58Z 2023-09-15 Ecriture inclusive : malaise à l’Académie française (2017) 2023-09-15 2023-09-15T16:32:51Z 2023-09-02T17:22:11Z 2023-09-02 « A la perte de temps passé à un travail vide de sens se substitue celle consacrée à des loisirs numériques eux-mêmes vides de sens » > Surtout ne plus penser ! 2023-08-25T15:03:36Z [2308.13418] Nougat: Neural Optical Understanding for Academic Documents Lukas Blecher 2023-09-17T18:36:48Z Thomas Scialom new generative model from @MetaAI trained to extract text from academic PDFs without needing traditional OCR engines. [Tweet](https://twitter.com/_philschmid/status/1703321340504166494) 2308.13418 Guillem Cucurull Lukas Blecher Nougat: Neural Optical Understanding for Academic Documents Robert Stojnic 2023-08-25T15:03:36Z Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition. 2023-09-17 2023-09-28T09:10:22Z 2023-09-28 Finetuning LLaMa + Text-to-SQL 2023-09-06T13:28:30Z Inside DSPy: The New Language Model Programming Framework You Need… – Towards AI 2023-09-06 Getting started with DeepMatcher.ipynb - Colaboratory 2023-09-20 2023-09-20T08:37:26Z Why transformative artificial intelligence is really, really hard to achieve 2023-09-23 2023-09-23T01:00:56Z > We think AI can be “transformative” in the same way the internet was, raising productivity and changing habits. But many daunting hurdles lie on the way to the accelerating growth rates predicted by some. Outline: > 1. The transformational potential of AI is constrained by its hardest problems > 2. Despite rapid progress in some AI subfields, major technical hurdles remain > 3. Even if technical AI progress continues, social and economic hurdles may limit its impact