Reinforcement learning

Reinforcement learning http://www.semanlink.net/tag/reinforcement_learning Documents tagged with Reinforcement learning Archit Sharma sur Twitter : "Direct Preference Optimization (DPO) allows you to fine-tune LMs directly from preferences via a simple classification loss, no RL required" http://www.semanlink.net/doc/2023/05/archit_sharma_sur_twitter_ev 2023-05-31T18:30:01Z Eric sur Twitter : "...Introducing Direct Preference Optimization (DPO), a simple classification loss provably equivalent to RLHF" http://www.semanlink.net/doc/2023/05/eric_sur_twitter_rlhf_is_the 2023-05-31T18:16:59Z Peter J. Liu sur Twitter : "RLHF-alternative without RL" http://www.semanlink.net/doc/2023/05/peter_j_liu_sur_twitter_her > TL;DR: Works as well as RLHF, but a lot simpler. About as easy and efficient as fine-tuning. Much better than simply fine-tuning on good examples. 2023-05-18T09:53:46Z Hyung Won Chung sur Twitter : "RLHF as an instance of using a learned objective function" http://www.semanlink.net/doc/2023/05/hyung_won_chung_sur_twitter_ 2023-05-18T09:47:49Z Aran Komatsuzaki sur Twitter : "Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning" http://www.semanlink.net/doc/2023/04/aran_komatsuzaki_sur_twitter__1 2023-04-27T08:13:24Z Reinforcement Learning for Language Models http://www.semanlink.net/doc/2023/04/rl_for_llms_md > I was puzzled for a while as to why we need RL for LM training, rather than just using supervised instruct tuning. I now have a convincing argument, which is also reflected in a recent talk by @johnschulman2. 1st convincing argument: > supervised learning allows only positive feedback (we show the model a series of questions and their correct answers) while **RL allows also for negative feedback** (the model is allowed to generate an answer an get a feedback saying "this is not correct")...if you as a learner are allowed to form your own hypotheses and ask the teacher if they are correct (as in the RL setting), even an adversarial teacher can no longer trick you into latching on to a wrong hypothesis. 2nd convincing argument is about knowledge-seeking queries > we want to encourage the model to answer based on its internal knowledge, but we don't know what this internal knowledge contains. In supervised training, we present the model with a question and its correct answer, and train the model to replicate the provided answer... But if we are succeed in training the model to generalize in [the cases it doesn't know], then we essentially teaches the model to make stuff up! it actively encourages the model to "lie". 2023-04-23T11:35:38Z Shayne Longpre sur Twitter : "A 🧵 on @OpenAI LLM "Alignment" (e.g. #ChatGPT)..." http://www.semanlink.net/doc/2023/02/shayne_longpre_sur_twitter_ 2023-02-27T23:18:48Z Prompting, Instruction Finetuning, and RLHF (CS224N) http://www.semanlink.net/doc/2023/02/prompting_instruction_finetuni 2023-02-16T23:12:04Z Some remarks on Large Language Models http://www.semanlink.net/doc/2023/01/some_remarks_on_large_language_ > There turned out to be a phase shift somewhere between 60B parameters and 175B parameters, that made language models super impressive. > **The performance of current days language models are not obtained by language modeling** > > - [Traditional] LMs are not [grounded](tag:grounded_language_learning) > > **3 conceptual steps between GPT-3 and chatGPT: Instructions, code, RLHF.** The last one is, I think, the least interesting despite getting the most attention > > Instruction tuning: For example, the human annotators would write something like "please summarize this text", followed by some text they got, followed by a summary they produced of this text. -> Some symbols ("summarize", "translate", "formal") are used in a consistent way together with the concept/task they denote. And they always appear in the beginning of the text. -> the act of producing a summary grounded to the human concept of "summary" > > code: programming language code data, and specifically data that contains both natural language instructions or descriptions (in the form of code comments) and the corresponding programming language code. This produced another very direct form of grounding. the human language describes concepts (or intents), which are then realized in the form of the corresponding programs. > > "[RL with Human Feedback](tag:reinforcement_learning_from_human_feedback)". This is a fancy way of saying that the model now observes two humans in a conversation, one playing the role of a user, and another playing the role of "the AI", demonstrating how the AI should respond in different situations. This clearly helps the model learn how dialogs work, and how to keep track of information across dialog states (something that is very hard to learn from just "found" data). 2023-01-03T09:15:16Z Tanishq Mathew Abraham sur Twitter : "Are you wondering how large language models like ChatGPT and InstructGPT actually work? One of the secret ingredients is RLHF... Let's dive into how RLHF works in 8 tweets!" / Twitter http://www.semanlink.net/doc/2022/12/tanishq_mathew_abraham_sur_twit 2022-12-28T17:44:47Z Illustrating Reinforcement Learning from Human Feedback (RLHF) http://www.semanlink.net/doc/2022/12/illustrating_reinforcement_lear 2022-12-10T11:51:09Z Prithviraj (Raj) Ammanabrolu sur Twitter : "The secret to aligning LMs to human preferences is reinforcement learning. ..." http://www.semanlink.net/doc/2022/10/prithviraj_raj_ammanabrolu_su 2022-10-06T01:56:53Z Meta Reinforcement Learning http://www.semanlink.net/doc/2019/12/meta_reinforcement_learning 2019-12-07T11:26:22Z Machine Learning for Humans, Part 5: Reinforcement Learning http://www.semanlink.net/doc/2019/09/machine_learning_for_humans_pa 2019-09-23T23:36:26Z Key Papers in Deep RL — OpenAI - Spinning Up documentation https://spinningup.openai.com/en/latest/spinningup/keypapers.html 2018-11-09T13:56:17Z Time-Contrastive Networks: Self-Supervised Learning from Video (2017) https://sermanet.github.io/imitate/ Self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. > We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images. > This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. 2018-10-27T14:59:43Z Reinforcement Learning from scratch – Insight Data https://blog.insightdatascience.com/reinforcement-learning-from-scratch-819b65f074d8 2018-06-09T09:26:53Z Lessons Learned Reproducing a Deep Reinforcement Learning Paper http://amid.fish/reproducing-deep-rl 2018-04-10T13:33:49Z Learning to write programs that generate images | DeepMind https://deepmind.com/blog/learning-to-generate-images/ This ability to interpret objects through the tools that created them gives us a richer understanding of the world and is an important aspect of our intelligence. 2018-03-28T12:11:42Z Introduction to Learning to Trade with Reinforcement Learning – WildML http://www.wildml.com/2018/02/introduction-to-learning-to-trade-with-reinforcement-learning/ 2018-02-11T12:20:30Z Evolution Strategies as a Scalable Alternative to Reinforcement Learning https://blog.openai.com/evolution-strategies/ 2018-01-06T15:11:28Z Welcoming the Era of Deep Neuroevolution - Uber Engineering Blog https://eng.uber.com/deep-neuroevolution/ > a suite of five papers that support the emerging realization that neuroevolution, where neural networks are optimized through evolutionary algorithms, is also an effective method to train deep neural networks for reinforcement learning (RL) problems. 2017-12-19T09:26:01Z AlphaGo Zero: Learning from scratch | DeepMind https://deepmind.com/blog/alphago-zero-learning-scratch/ 2017-10-18T22:43:19Z