Datasets
http://www.semanlink.net/tag/datasets
Documents tagged with Datasets
-
MSMARCO | MSMARCO-Question-Answering
http://www.semanlink.net/doc/2023/07/msmarco_%7C_msmarco_question_answ
> MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking, Keyphrase Extraction, and Conversational Search Studies, or what the community thinks would be useful.
1 million unique real queries that were generated by sampling and anonymizing [Bing](tag:bing) usage logs.
2023-07-14T10:28:08Z
-
ibiscp/LLM-IMDB: Proof of concept app using LangChain and LLMs to retrieve information from graphs, built with the IMDB dataset
http://www.semanlink.net/doc/2023/04/ibiscp_llm_imdb_proof_of_conce
> IMDB-LLM, a proof of concept app that demonstrates the power of LangChain and LLMs in extracting information from graphs!
2023-04-10T23:00:57Z
-
Daniel Vila Suero sur Twitter : "Data quality is key for LLMs, but we're building Open Source LLMs with data of "unknown" quality... Introducing Alpaca GarbageCollector..."
http://www.semanlink.net/doc/2023/04/daniel_vila_suero_sur_twitter_
> a cross-lingual SetFit model to identify potential bad instructions in Alpaca-like datasets
2023-04-05T18:37:29Z
-
Support of very large dataset? - 🤗Datasets - Hugging Face Forums
http://www.semanlink.net/doc/2023/03/support_of_very_large_dataset_
[Big data? 🤗 Datasets to the rescue! - Hugging Face Course](doc:2023/03/big_data_🤗_datasets_to_the_re)
2023-03-12T12:14:56Z
-
Guillaume Lample sur Twitter : "Today we release LLaMA, 4 foundation models ranging from 7B to 65B parameters..."
http://www.semanlink.net/doc/2023/02/guillaume_lample_sur_twitter_
> LLaMA-13B outperforms OPT and GPT-3 175B on most benchmarks. LLaMA-65B is competitive with Chinchilla 70B and PaLM 540B.
>
> The weights for all models are open and available
>
> trained on at least 1T tokens,
>
> Unlike Chinchilla, PaLM, or GPT-3, we only use datasets publicly available,
>
> We also briefly tried instruction finetuning
LLaMA-13B is competitive with GPT-3, despite being 10x smaller.
But that's not really open-source
[github](https://github.com/facebookresearch/llama)
"The license prohibits using the models or any data produced by the models for any type of commercial or production purpose."
2023-02-25T00:59:01Z
-
Aran Komatsuzaki sur Twitter : "Poisoning Web-Scale Training Datasets is Practical Shows how to effectively poison 0.01% of datasets like LAION-400M for just $60 USD"
http://www.semanlink.net/doc/2023/02/aran_komatsuzaki_sur_twitter_
2023-02-21T17:14:22Z
-
Open Graph Benchmark | A collection of benchmark datasets, data-loaders and evaluators for graph machine learning in PyTorch.
http://www.semanlink.net/doc/2023/02/open_graph_benchmark_%7C_a_collec
2023-02-07T14:02:45Z
-
[2208.11857] Shortcut Learning of Large Language Models in Natural Language Understanding: A Survey
http://www.semanlink.net/doc/2022/08/2208_11857_shortcut_learning_
2022-08-27T10:39:46Z
-
AllenNLP sur Twitter : "Dataset: training data for @MetaAI 's No Language Left Behind NLLB-200 models!..."
http://www.semanlink.net/doc/2022/08/allennlp_sur_twitter_dataset
[No Language Left Behind](doc:2022/07/no_language_left_behind)
2022-08-25T21:26:55Z
-
The Quick Guide to SQuAD
http://www.semanlink.net/doc/2022/02/the_quick_guide_to_squad
2022-02-03T18:22:21Z
-
facebookresearch/UnsupervisedQA: Unsupervised Question answering via Cloze Translation
http://www.semanlink.net/doc/2021/12/facebookresearch_unsupervisedqa
> This repository provides code to run pre-trained models to generate synthetic question answering question data. We also make a very large synthetic training dataset for extractive question answering available.
[Paper](doc:2021/12/1906_04980_unsupervised_quest)
2021-12-07T23:54:24Z
-
Detecting Duplicate Questions (2019)
http://www.semanlink.net/doc/2021/10/detecting_duplicate_questions_
2021-10-14T11:47:03Z
-
[2107.12708] QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
http://www.semanlink.net/doc/2021/08/2107_12708_qa_dataset_explosi
recommandé par [Sebastian Ruder](tag:sebastian_ruder)
2021-08-06T22:01:16Z
-
[2104.08663] BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
http://www.semanlink.net/doc/2021/07/2104_08663_beir_a_heterogeno
[GitHub](doc:2021/07/ukplab_beir_a_heterogeneous_be)
> Our results show **BM25 is a robust baseline**
and **Reranking-based models overall achieve
the best zero-shot performances**, however, at
high computational costs. In contrast, **Denseretrieval
models are computationally more efficient
but often underperform other approaches**
17 English evaluation datasets, 9 heterogeneous tasks (Non-English left for future work)
2021-07-09T12:36:38Z
-
UKPLab/beir: A Heterogeneous Benchmark for Information Retrieval.
http://www.semanlink.net/doc/2021/07/ukplab_beir_a_heterogeneous_be
> BEIR is a heterogeneous benchmark containing diverse IR tasks.
> Easy to use, evaluate your NLP-based retrieval models across 15+ diverse IR datasets.
[Paper](doc:2021/07/2104_08663_beir_a_heterogeno)
2021-07-09T12:19:50Z
-
The Extreme Classification Repository
http://www.semanlink.net/doc/2020/08/the_extreme_classification_repo
benchmark datasets, metrics, results and code that can be used for evaluating the performance of extreme multi-label algorithms.
[Related blog post](doc:2020/08/everything_you_always_wanted_to)
2020-08-12T01:10:51Z
-
huggingface/nlp: nlp: datasets and evaluation metrics for NLP in NumPy, Pandas, PyTorch and TensorFlow
http://www.semanlink.net/doc/2020/05/huggingface_nlp_nlp_datasets_
2020-05-27T02:24:06Z
-
Siamese Network for Image and Text similarity using Keras
http://www.semanlink.net/doc/2020/01/siamese_network_keras_for_image
2020-01-22T16:50:08Z
-
Is That a Duplicate Quora Question? | LinkedIn
http://www.semanlink.net/doc/2019/07/is_that_a_duplicate_quora_quest
2019-07-03T01:33:30Z
-
Classifying duplicate questions from Quora with Keras | R-bloggers
http://www.semanlink.net/doc/2019/07/classifying_duplicate_questions
2019-07-03T01:32:20Z
-
Finding Similar Quora Questions with BOW, TFIDF and Xgboost
http://www.semanlink.net/doc/2019/07/finding_similar_quora_questions
[Part 2](/doc/?uri=https%3A%2F%2Ftowardsdatascience.com%2Ffinding-similar-quora-questions-with-word2vec-and-xgboost-1a19ad272c0d)
2019-07-02T01:26:01Z
-
Semantic textual similarity | NLP-progress
http://www.semanlink.net/doc/2019/07/semantic_textual_similarity_%7C_n
2019-07-02T01:11:27Z
-
Quora Question Pairs | Kaggle
http://www.semanlink.net/doc/2019/07/quora_question_pairs_%7C_kaggle
2019-07-02T01:07:48Z
-
sebastianruder/NLP-progress: Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
https://github.com/sebastianruder/NLP-progress
2018-06-23T01:04:30Z