Low-Resource NLP

Low-Resource NLP http://www.semanlink.net/tag/nlp_low_resource_scenarios Documents tagged with Low-Resource NLP [2309.06131] Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection http://www.semanlink.net/doc/2023/09/2309_06131_annotating_data_fo compare les Sentence Transformers, les cross encoders et Colbert dans le cadre low resource > "optimal'' subsets of training data that provide high effectiveness at low annotation cost do exist, but current mainstream AL strategies applied to PLM rankers are not capable of identifying them. 2023-09-14T00:47:05Z Do large language models work on Tagalog? http://www.semanlink.net/doc/2023/08/do_large_language_models_work_o how LLMs work on Tagalog data in structured prediction tasks? > tl;dr: you might get more bang for your buck training a supervised model! 2023-08-07T09:16:16Z Comparing Africa-centric Models to OpenAI's GPT3.5 - Lelapa http://www.semanlink.net/doc/2023/02/comparing_africa_centric_models 2023-02-10T21:13:07Z Towards a Tagalog NLP pipeline http://www.semanlink.net/doc/2023/02/towards_a_tagalog_nlp_pipeline 2023-02-04T16:41:56Z Colin Leong sur Twitter : "This book is about the only "dataset" I ever found for Hani. My first ever foray into the field, I found an electronic copy and munged it into a Hani/English parallel corpus, and trained a JoeyNMT model with the help of @MasakhaneNLP and @KreutzerJulia in particular." / Twitter http://www.semanlink.net/doc/2023/01/colin_leong_sur_twitter_this [joeynmt/joeynmt: Minimalist NMT for educational purposes](doc:2023/01/joeynmt_joeynmt_minimalist_nmt) 2023-01-05T13:34:03Z How to Train an mT5 Model for Translation With Simple Transformers | by Thilina Rajapakse | Towards Data Science http://www.semanlink.net/doc/2022/09/how_to_train_an_mt5_model_for_t 2022-09-25T15:02:31Z [2203.09435] Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation http://www.semanlink.net/doc/2022/09/2203_09435_expanding_pretrain 2022-09-08T11:17:10Z [2209.00099] Efficient Methods for Natural Language Processing: A Survey http://www.semanlink.net/doc/2022/09/2209_00099_efficient_methods_ > We thus structure this survey by following the typical NLP model pipeline and present the existing methods that aim to make the respective stage more efficient. 2022-09-04T11:26:48Z AllenNLP sur Twitter : "Dataset: training data for @MetaAI 's No Language Left Behind NLLB-200 models!..." http://www.semanlink.net/doc/2022/08/allennlp_sur_twitter_dataset [No Language Left Behind](doc:2022/07/no_language_left_behind) 2022-08-25T21:26:55Z [1807.00745] Training a Neural Network in a Low-Resource Setting on Automatically Annotated Noisy Data http://www.semanlink.net/doc/2022/07/1807_00745_training_a_neural_ Automatically created labels can deteriorate a classifier’s performance > approach to training a neural network with **a combination of a small amount of clean data and a larger set of automatically annotated, noisy instances** > > We model the noise explicitly using a **noise layer** that is added to the network architecture. This allows us to directly optimize the network weights using standard techniques. After training, the noise layer is not needed anymore, removing any added complexity. [related blog post](https://www.roxanne-euproject.org/news/blog/making-natural-language-processing-work-for-little-training-data) 2022-07-18T11:39:48Z Dealing with Data Scarcity in Natural Language Processing | by Yves Peirsman | NLPTown | Medium 2019) http://www.semanlink.net/doc/2022/07/dealing_with_data_scarcity_in_n > Snorkel’s process is as follows. First, a developer writes labelling functions and evaluates them on a small set of labelled training data. Snorkel allows us to evaluate the accuracy and coverage of all our labelling functions, and their overlaps and conflicts with each other. Next, it trains a generative label model over these labelling functions that learns how best to combine them. Finally, this label model outputs probabilistic labels that we can use to train an end model. 2022-07-18T11:06:41Z No Language Left Behind http://www.semanlink.net/doc/2022/07/no_language_left_behind [tweet](https://twitter.com/vedanujg/status/1544925973635690497?s=20&t=ZunLNurhmN7aHDmnzPO5yQ) 2022-07-06T20:57:57Z ACL 2022 Highlights http://www.semanlink.net/doc/2022/06/acl_2022_highlights 2022-06-07T17:58:34Z Isaac R Caswell sur Twitter : "How many languages can we support with Machine Translation?..." http://www.semanlink.net/doc/2022/05/isaac_r_caswell_sur_twitter_ > We train a translation model on 1000+ languages, using it to launch 24 new languages on Google Translate without any parallel data for these languages... 2022-05-18T16:12:44Z [2205.03983] Building Machine Translation Systems for the Next Thousand Languages http://www.semanlink.net/doc/2022/05/2205_03983_building_machine_t 2022-05-10T08:00:10Z [1910.06294] Training Compact Models for Low Resource Entity Tagging using Pre-trained Language Models http://www.semanlink.net/doc/2022/03/1910_06294_training_compact_m 2022-03-31T21:06:23Z [2004.05119] Beyond Fine-tuning: Few-Sample Sentence Embedding Transfer http://www.semanlink.net/doc/2022/03/2004_05119_beyond_fine_tuning > Fine-tuning (FT) pre-trained sentence embedding models on small datasets has been shown to have limitations. In this paper we show that concatenating the embeddings from the pre-trained model with those from a simple sentence embedding model trained only on the target data, can improve over the performance of FT for few-sample tasks 2022-03-31T21:04:02Z Domain adaptation of word embeddings through the exploitation of in-domain corpora and knowledge bases (PhD Thesis 2021) http://www.semanlink.net/doc/2022/03/domain_adaptation_of_word_embed Thèse de Hicham El Boukkouri, univ Paris-Saclay [Github](https://github.com/helboukkouri/phd-code) ### Goal Given a target specialized domain, improve the quality of general-domain word representations using in-domain corpora and/or knowledge bases ### Contributions #### a method for specializing general-domain embeddings in a [Low-Resource](tag:nlp_low_resource_scenarios) context. > - train static representations on the task corpus, > - resume the pre-training of general-domain contextual embeddings on the same task corpus, > - finally, combine both static and contextual representations into one final model #### we tackle the issue of using a general-domain vocabulary in a specialized domain #### Evaluation of re-training vs training from scratch on specialized corpora using a specialized vocabulary training from scratch better, but not that much: re-training from a general model is still appropriate as it is less expensive and leads to comparable, although slightly lower, performance #### Regarding subword-based tokenization systems > we argue that they are inconvenient in practice -> CharacterBERT, a variant of BERT that uses ELMo’s character-based system instead of WordPieces. More convenient ti use, superior robustness to misspellings #### Ways to specialize general-domain representations using knowledge bases a strong baseline using a simple method relying on graph embeddings and concatenation, using only is_a relation > both static and contextual embeddings may effectively be specialized using this simple approach #### Knowledge Injection Modules (KIM) that inject the knowledge representations directly within the BERT-like models' architecture ### Notes > our experiments focused on a single setting (i.e. the medical domain and the English language) > meta-embeddings, an approach that consists in combining different sets of representations for achieving improved performance 2022-03-23T16:32:44Z Sebastian Ruder sur Twitter : "Modular and Parameter-Efficient Fine-Tuning for NLP Models" http://www.semanlink.net/doc/2021/12/sebastian_ruder_sur_twitter_ 2021-12-17T11:45:32Z [2112.01488] ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction http://www.semanlink.net/doc/2021/12/2112_01488_colbertv2_effecti 2021-12-05T10:33:54Z [1911.02655] Towards Domain Adaptation from Limited Data for Question Answering Using Deep Neural Networks http://www.semanlink.net/doc/2021/11/1911_02655 domain adaptation for enabling QA systems to answer questions posed against documents in new specialized domains > In experiments on question answering in the **automobile manual domain** we demonstrate that **standard DNN transfer learning techniques work surprisingly well** in adapting DNN models to a new domain **using limited amounts of annotated training data** in the new domain. > **unsupervised domain adaption techniques to a base model could provide some improvement in the absence of in-domain labeled training data**, but there may be **no advantage to these methods once standard transfer learning methods are able to use even limited amounts of annotated training data** in a new domain. 2021-11-19T00:31:23Z [2108.13854] Contrastive Domain Adaptation for Question Answering using Limited Text Corpora http://www.semanlink.net/doc/2021/11/2108_13854_contrastive_domain_1 > a framework for answering out-of-domain questions in QA settings with limited text corpora > combines techniques from question generation and domain-invariant learning to answer out-of-domain questions in settings with limited text corpora. Here, we train a QA system on both source data and generated data from the target domain with a contrastive adaptation loss that is incorporated in the training objective. 2021-11-19T00:18:40Z [1706.03610] Neural Domain Adaptation for Biomedical Question Answering http://www.semanlink.net/doc/2021/11/1706_03610_neural_domain_adap Datasets are generally too small to train a DL system for QA from scratch. > we adapt a neural QA system trained on a large open-domain dataset (SQuAD) to a biomedical dataset (BioASQ) by employing various transfer learning techniques. Our network architecture is based on a state-of-the-art QA system, extended with biomedical word embeddings and a novel mechanism to answer list questions. In contrast to existing biomedical QA systems, our system does not rely on domain-specific ontologies, parsers or entity taggers, which are expensive to create. 2021-11-19T00:09:38Z Nils Reimers sur Twitter : "Neural Search for Low Resource Scenarios..." http://www.semanlink.net/doc/2021/10/nils_reimers_sur_twitter_neu 1. Is low resource actually realistic? - No - Important research questions: - how to learn unsupervised - how to exploit structure (ex. title and body) - how to learn a concept from a single sentence 2. How good are our benchmarks? 3. Domain-Adaptation for Dense Embeddings - first unsupervised training, then supervised - TDSAE > ICT > MLM - unclear how to adapt an existing model to a new model > TSDAE differs in that the decoder in MLM has access to full-length word embeddings for every single token. The TSDAE decoder only has access to the sentence vector produced by the encoder. 2021-10-27T01:48:22Z neubig/lowresource-nlp-bootcamp-2020: The website for the CMU Language Technologies Institute low resource NLP bootcamp 2020 http://www.semanlink.net/doc/2021/10/neubig_lowresource_nlp_bootcamp 8 lectures (plus exercises) focused on NLP in data-scarse languages 2021-10-16T14:54:17Z VaLaR NMT: Vastly Lacking Resources Neural Machine Translation (2019) http://www.semanlink.net/doc/2021/10/valar_nmt_vastly_lacking_resou > We focus on extremely low-resource setting, where we are **limited to less than 10k parallel data and no mono-lingual corpora**... we create a characterdecoder-based seq2seq NMT model as a baseline and compare its performance on various levels of data scarcity. Then, we explore the performance benefit of transfer learning by training a model on a different language. .. Lastly, we use **language models and a noisy dictionary to augment our training data**. Utilizing both transfer learning and data augmentation, we see a 1.5 BLEU score improvement over the baseline 2021-10-14T15:46:04Z BigScience Research Workshop sur Twitter : "Come help us improve language resource visibility over the next week..." http://www.semanlink.net/doc/2021/10/bigscience_research_workshop_su 2021-10-07T12:05:24Z Linguistic Diversity http://www.semanlink.net/doc/2021/10/linguistic_diversity > We create a consistent data model to complement the existing ACL Anthology Corpus with data from later years and of non-ACL conferences. We do this by augmenting the corpus using Semantic Scholar’s API and scraping ACL Anthology itself. This is a consolidated dataset for 11 conferences with different attributes. Stay tuned :) [[2004.09095] The State and Fate of Linguistic Diversity and Inclusion in the NLP World](doc:2021/10/2004_09095_the_state_and_fate) 2021-10-03T12:39:09Z [2004.09095] The State and Fate of Linguistic Diversity and Inclusion in the NLP World http://www.semanlink.net/doc/2021/10/2004_09095_the_state_and_fate 2021-10-03T11:50:06Z [2109.04513] Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach http://www.semanlink.net/doc/2021/09/2109_04513_filling_the_gaps_i [tweet](doc:2021/09/koren_lazar_sur_twitter_m) > Akkadian language, the lingua franca of the time. > despite data scarcity (1M tokens) we can achieve state of the art performance on missing tokens prediction (89% hit@5) using a greedy decoding scheme and **pretraining on data from other languages and different time periods**. 2021-09-23T10:56:10Z Koren Lazar sur Twitter : "...Modern pre-trained language models are applicable even in extreme low-resource settings as the case of the ancient Akkadian language." http://www.semanlink.net/doc/2021/09/koren_lazar_sur_twitter_m [[2109.04513] Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach](doc:2021/09/2109_04513_filling_the_gaps_i) 2021-09-23T10:42:17Z The 4 Biggest Open Problems in NLP (2019) http://www.semanlink.net/doc/2021/08/the_4_biggest_open_problems_in_ 2021-08-26T15:23:03Z [2010.02353] Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages http://www.semanlink.net/doc/2021/08/2010_02353_participatory_rese about machine translation using parallel corpora only 2021-08-25T17:01:12Z CC-100: Monolingual Datasets from Web Crawl Data http://www.semanlink.net/doc/2021/07/cc_100_monolingual_datasets_fr Attempt to recreate the dataset used for training XLM-R ([[1911.02116] Unsupervised Cross-lingual Representation Learning at Scale](doc:2021/07/1911_02116_unsupervised_cross)) 2021-07-29T00:20:28Z Davlan (David Adelani) @Huggingface http://www.semanlink.net/doc/2021/07/davlan_david_adelani_hugging includes a [roberta-base-finetuned-hausa](https://huggingface.co/Davlan/xlm-roberta-base-finetuned-hausa) (using data from [CC-100: Monolingual Datasets from Web Crawl Data](doc:2021/07/cc_100_monolingual_datasets_fr)) 2021-07-29T00:01:52Z [2107.00676] A Primer on Pretrained Multilingual Language Models http://www.semanlink.net/doc/2021/07/2107_00676_a_primer_on_pretra > MLLMs are useful for bilingual tasks, particularly in low resource scenarios. > > The surprisingly good performance of MLLMs in crosslingual transfer as well as bilingual tasks motivates the hypothesis that MLLMs are learning universal patterns. However, our survey of the studies in this space indicates that there is no consensus yet. 2021-07-13T13:33:29Z [2010.12309] A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios http://www.semanlink.net/doc/2021/07/2010_12309_a_survey_on_recent Low-resource scenarios: low-resource languages, but also non standard domain and tasks. one key goal of this survey is to highlight the underlying assumptions [Blog post](https://towardsdatascience.com/a-visual-guide-to-low-resource-nlp-d7b4c7b1a4bc) 2021-07-06T13:08:01Z [2006.07264] Low-resource Languages: A Review of Past Work and Future Challenges http://www.semanlink.net/doc/2021/07/2006_07264_low_resource_langu bof 2021-07-06T13:07:39Z Practical Natural Language Processing for Low-Resource Languages http://www.semanlink.net/doc/2021/07/practical_natural_language_proc 2021-07-06T12:51:20Z Why You Should Do NLP Beyond English http://www.semanlink.net/doc/2020/08/why_you_should_do_nlp_beyond_en > Only a few hundred languages are represented on the web and speakers of minority languages are severely limited in the information available to them. 2020-08-01T18:50:35Z