<?xml version='1.0' encoding='UTF-8'  ?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/">	<channel rdf:about="http://www.semanlink.net/tag/reinforcement_learning">		<title>Reinforcement learning</title>		<link>http://www.semanlink.net/tag/reinforcement_learning</link>		<description>Documents tagged with Reinforcement learning</description>		<items>			<rdf:Seq>							<rdf:li resource="http://www.semanlink.net/doc/2025/02/deepseek_r1_model_by_deepseek_a"/>				<rdf:li resource="http://www.semanlink.net/doc/2025/02/diffuse_one"/>				<rdf:li resource="http://www.semanlink.net/doc/2025/02/autotelic_agents_with_intrinsic"/>				<rdf:li resource="http://www.semanlink.net/doc/2023/05/archit_sharma_sur_twitter_ev"/>				<rdf:li resource="http://www.semanlink.net/doc/2023/05/eric_sur_twitter_rlhf_is_the"/>				<rdf:li resource="http://www.semanlink.net/doc/2023/05/peter_j_liu_sur_twitter_her"/>				<rdf:li resource="http://www.semanlink.net/doc/2023/05/hyung_won_chung_sur_twitter_"/>				<rdf:li resource="http://www.semanlink.net/doc/2023/04/aran_komatsuzaki_sur_twitter__1"/>				<rdf:li resource="http://www.semanlink.net/doc/2023/04/rl_for_llms_md"/>				<rdf:li resource="http://www.semanlink.net/doc/2023/02/shayne_longpre_sur_twitter_"/>				<rdf:li resource="http://www.semanlink.net/doc/2023/02/prompting_instruction_finetuni"/>				<rdf:li resource="http://www.semanlink.net/doc/2023/01/some_remarks_on_large_language_"/>				<rdf:li resource="http://www.semanlink.net/doc/2022/12/tanishq_mathew_abraham_sur_twit"/>				<rdf:li resource="http://www.semanlink.net/doc/2022/12/illustrating_reinforcement_lear"/>				<rdf:li resource="http://www.semanlink.net/doc/2022/10/prithviraj_raj_ammanabrolu_su"/>				<rdf:li resource="http://www.semanlink.net/doc/2019/12/meta_reinforcement_learning"/>				<rdf:li resource="http://www.semanlink.net/doc/2019/09/machine_learning_for_humans_pa"/>				<rdf:li resource="https://spinningup.openai.com/en/latest/spinningup/keypapers.html"/>				<rdf:li resource="https://sermanet.github.io/imitate/"/>				<rdf:li resource="https://blog.insightdatascience.com/reinforcement-learning-from-scratch-819b65f074d8"/>				<rdf:li resource="http://amid.fish/reproducing-deep-rl"/>				<rdf:li resource="https://deepmind.com/blog/learning-to-generate-images/"/>				<rdf:li resource="http://www.wildml.com/2018/02/introduction-to-learning-to-trade-with-reinforcement-learning/"/>				<rdf:li resource="https://blog.openai.com/evolution-strategies/"/>				<rdf:li resource="https://eng.uber.com/deep-neuroevolution/"/>				<rdf:li resource="https://deepmind.com/blog/alphago-zero-learning-scratch/"/>			</rdf:Seq>		</items>	</channel>		<item rdf:about="http://www.semanlink.net/doc/2025/02/deepseek_r1_model_by_deepseek_a">		<title>deepseek-r1 Model by Deepseek-ai | NVIDIA NIM</title>		<link>http://www.semanlink.net/doc/2025/02/deepseek_r1_model_by_deepseek_a</link>		<description>&gt; DeepSeek-R1 is a first-generation **reasoning model trained using large-scale reinforcement learning** (RL) to solve complex reasoning tasks across domains such as math, code, and language. The model leverages RL to develop reasoning capabilities, which are further enhanced through supervised fine-tuning (SFT) to improve readability and coherence.		</description>		<dc:date>2025-02-24T13:34:19Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2025/02/diffuse_one">		<title>diffuse.one/reasoning_update_0</title>		<link>http://www.semanlink.net/doc/2025/02/diffuse_one</link>		<description>&gt; There is an emerging pattern of fine-tuning a small language model followed by reinforcement learning.

&gt; A reasoning model is a large language model that is trained to output both a chain of thought and a response. The chain of thought should be relatively long (
&gt; 1,000 tokens) and the reasoning should improve its performance relative to a similar-sized non-reasoning models. This is sometimes called &quot;test-time&quot; or &quot;inference-time&quot; scaling because reasoning models emit more tokens per completion and gain some performance as a result.		</description>		<dc:date>2025-02-24T13:21:09Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2025/02/autotelic_agents_with_intrinsic">		<title>Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: A Short Survey</title>		<link>http://www.semanlink.net/doc/2025/02/autotelic_agents_with_intrinsic</link>		<description>&gt; Building autonomous machines that can explore open-ended environments, discover possible interactions and build repertoires of skills is a general objective of artificial intelligence. Developmental approaches argue that this can only be achieved by autotelic agents: intrinsically motivated learning agents that can learn to represent, generate, select and solve their own problems. In recent years, the convergence of developmental approaches with deep reinforcement learning (rl) methods has been leading to the emergence of a new field: developmental reinforcement learning.		</description>		<dc:date>2025-02-09T15:56:41Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2023/05/archit_sharma_sur_twitter_ev">		<title>Archit Sharma sur Twitter : &quot;Direct Preference Optimization (DPO) allows you to fine-tune LMs directly from preferences via a simple classification loss, no RL required&quot;</title>		<link>http://www.semanlink.net/doc/2023/05/archit_sharma_sur_twitter_ev</link>		<dc:date>2023-05-31T18:30:01Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2023/05/eric_sur_twitter_rlhf_is_the">		<title>Eric sur Twitter : &quot;...Introducing Direct Preference Optimization (DPO), a simple classification loss provably equivalent to RLHF&quot;</title>		<link>http://www.semanlink.net/doc/2023/05/eric_sur_twitter_rlhf_is_the</link>		<dc:date>2023-05-31T18:16:59Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2023/05/peter_j_liu_sur_twitter_her">		<title>Peter J. Liu sur Twitter : &quot;RLHF-alternative without RL&quot; </title>		<link>http://www.semanlink.net/doc/2023/05/peter_j_liu_sur_twitter_her</link>		<description>&gt; TL;DR: Works as well as RLHF, but a lot simpler. About as easy and efficient as fine-tuning. Much better than simply fine-tuning on good examples.		</description>		<dc:date>2023-05-18T09:53:46Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2023/05/hyung_won_chung_sur_twitter_">		<title>Hyung Won Chung sur Twitter : &quot;RLHF as an instance of using a learned objective function&quot;</title>		<link>http://www.semanlink.net/doc/2023/05/hyung_won_chung_sur_twitter_</link>		<dc:date>2023-05-18T09:47:49Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2023/04/aran_komatsuzaki_sur_twitter__1">		<title>Aran Komatsuzaki sur Twitter : &quot;Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning&quot;</title>		<link>http://www.semanlink.net/doc/2023/04/aran_komatsuzaki_sur_twitter__1</link>		<dc:date>2023-04-27T08:13:24Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2023/04/rl_for_llms_md">		<title>Reinforcement Learning for Language Models</title>		<link>http://www.semanlink.net/doc/2023/04/rl_for_llms_md</link>		<description>&gt; I was puzzled for a while as to why we need RL for LM training, rather than just using supervised instruct tuning. I now have a convincing argument, which is also reflected in a recent talk by @johnschulman2.

1st convincing argument:

&gt; supervised learning allows only positive feedback (we show the model a series of questions and their correct answers) while **RL allows also for negative feedback** (the model is allowed to generate an answer an get a feedback saying &quot;this is not correct&quot;)...if you as a learner are allowed to form your own hypotheses and ask the teacher if they are correct (as in the RL setting), even an adversarial teacher can no longer trick you into latching on to a wrong hypothesis.

2nd convincing argument is about knowledge-seeking queries

&gt; we want to encourage the model to answer based on its internal knowledge, but we don&apos;t know what this internal knowledge contains. In supervised training, we present the model with a question and its correct answer, and train the model to replicate the provided answer... But if we are succeed in training the model to generalize in [the cases it doesn&apos;t know&#93;, then we essentially teaches the model to make stuff up! it actively encourages the model to &quot;lie&quot;.		</description>		<dc:date>2023-04-23T11:35:38Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2023/02/shayne_longpre_sur_twitter_">		<title>Shayne Longpre sur Twitter :  &quot;A 🧵 on @OpenAI LLM &quot;Alignment&quot; (e.g. #ChatGPT)...&quot;</title>		<link>http://www.semanlink.net/doc/2023/02/shayne_longpre_sur_twitter_</link>		<dc:date>2023-02-27T23:18:48Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2023/02/prompting_instruction_finetuni">		<title>Prompting, Instruction Finetuning, and RLHF (CS224N)</title>		<link>http://www.semanlink.net/doc/2023/02/prompting_instruction_finetuni</link>		<dc:date>2023-02-16T23:12:04Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2023/01/some_remarks_on_large_language_">		<title>Some remarks on Large Language Models</title>		<link>http://www.semanlink.net/doc/2023/01/some_remarks_on_large_language_</link>		<description>&gt; There turned out to be a phase shift somewhere between 60B parameters and 175B parameters, that made language models super impressive.

&gt; **The performance of current days language models are not obtained by language modeling**
&gt;
&gt;    - [Traditional&#93; LMs are not [grounded&#93;(tag:grounded_language_learning)
&gt; 
&gt; **3 conceptual steps between GPT-3 and chatGPT: Instructions, code, RLHF.** The last one is, I think, the least interesting despite getting the most attention
&gt;
&gt; Instruction tuning: For example, the human annotators would write something like &quot;please summarize this text&quot;, followed by some text they got, followed by a summary they produced of this text. -&gt; Some symbols (&quot;summarize&quot;, &quot;translate&quot;, &quot;formal&quot;) are used in a consistent way together with the concept/task they denote. And they always appear in the beginning of the text. -&gt; the act of producing a summary grounded to the human concept of &quot;summary&quot;
&gt;
&gt; code: programming language code data, and specifically data that contains both natural language instructions or descriptions (in the form of code comments) and the corresponding programming language code. This produced another very direct form of grounding. the human language describes concepts (or intents), which are then realized in the form of the corresponding programs.
&gt;
&gt; &quot;[RL with Human Feedback&#93;(tag:reinforcement_learning_from_human_feedback)&quot;. This is a fancy way of saying that the model now observes two humans in a conversation, one playing the role of a user, and another playing the role of &quot;the AI&quot;, demonstrating how the AI should respond in different situations. This clearly helps the model learn how dialogs work, and how to keep track of information across dialog states (something that is very hard to learn from just &quot;found&quot; data).		</description>		<dc:date>2023-01-03T09:15:16Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2022/12/tanishq_mathew_abraham_sur_twit">		<title>Tanishq Mathew Abraham sur Twitter : &quot;Are you wondering how large language models like ChatGPT and InstructGPT actually work? One of the secret ingredients is RLHF... Let&apos;s dive into how RLHF works in 8 tweets!&quot; / Twitter</title>		<link>http://www.semanlink.net/doc/2022/12/tanishq_mathew_abraham_sur_twit</link>		<dc:date>2022-12-28T17:44:47Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2022/12/illustrating_reinforcement_lear">		<title>Illustrating Reinforcement Learning from Human Feedback (RLHF)</title>		<link>http://www.semanlink.net/doc/2022/12/illustrating_reinforcement_lear</link>		<dc:date>2022-12-10T11:51:09Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2022/10/prithviraj_raj_ammanabrolu_su">		<title>Prithviraj (Raj) Ammanabrolu sur Twitter : &quot;The secret to aligning LMs to human preferences is reinforcement learning. ...&quot;</title>		<link>http://www.semanlink.net/doc/2022/10/prithviraj_raj_ammanabrolu_su</link>		<dc:date>2022-10-06T01:56:53Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2019/12/meta_reinforcement_learning">		<title>Meta Reinforcement Learning</title>		<link>http://www.semanlink.net/doc/2019/12/meta_reinforcement_learning</link>		<dc:date>2019-12-07T11:26:22Z</dc:date>	</item>	<item rdf:about="http://www.semanlink.net/doc/2019/09/machine_learning_for_humans_pa">		<title>Machine Learning for Humans, Part 5: Reinforcement Learning</title>		<link>http://www.semanlink.net/doc/2019/09/machine_learning_for_humans_pa</link>		<dc:date>2019-09-23T23:36:26Z</dc:date>	</item>	<item rdf:about="https://spinningup.openai.com/en/latest/spinningup/keypapers.html">		<title>Key Papers in Deep RL — OpenAI - Spinning Up documentation</title>		<link>https://spinningup.openai.com/en/latest/spinningup/keypapers.html</link>		<dc:date>2018-11-09T13:56:17Z</dc:date>	</item>	<item rdf:about="https://sermanet.github.io/imitate/">		<title>Time-Contrastive Networks: Self-Supervised Learning from Video (2017)</title>		<link>https://sermanet.github.io/imitate/</link>		<description>Self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. 

&gt; We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images.
&gt; This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm.		</description>		<dc:date>2018-10-27T14:59:43Z</dc:date>	</item>	<item rdf:about="https://blog.insightdatascience.com/reinforcement-learning-from-scratch-819b65f074d8">		<title>Reinforcement Learning from scratch – Insight Data</title>		<link>https://blog.insightdatascience.com/reinforcement-learning-from-scratch-819b65f074d8</link>		<dc:date>2018-06-09T09:26:53Z</dc:date>	</item>	<item rdf:about="http://amid.fish/reproducing-deep-rl">		<title>Lessons Learned Reproducing a Deep Reinforcement Learning Paper</title>		<link>http://amid.fish/reproducing-deep-rl</link>		<dc:date>2018-04-10T13:33:49Z</dc:date>	</item>	<item rdf:about="https://deepmind.com/blog/learning-to-generate-images/">		<title>Learning to write programs that generate images | DeepMind</title>		<link>https://deepmind.com/blog/learning-to-generate-images/</link>		<description>This ability to interpret objects through the tools that created them gives us a richer understanding of the world and is an important aspect of our intelligence.		</description>		<dc:date>2018-03-28T12:11:42Z</dc:date>	</item>	<item rdf:about="http://www.wildml.com/2018/02/introduction-to-learning-to-trade-with-reinforcement-learning/">		<title>Introduction to Learning to Trade with Reinforcement Learning – WildML</title>		<link>http://www.wildml.com/2018/02/introduction-to-learning-to-trade-with-reinforcement-learning/</link>		<dc:date>2018-02-11T12:20:30Z</dc:date>	</item>	<item rdf:about="https://blog.openai.com/evolution-strategies/">		<title>Evolution Strategies as a Scalable Alternative to Reinforcement Learning</title>		<link>https://blog.openai.com/evolution-strategies/</link>		<dc:date>2018-01-06T15:11:28Z</dc:date>	</item>	<item rdf:about="https://eng.uber.com/deep-neuroevolution/">		<title>Welcoming the Era of Deep Neuroevolution - Uber Engineering Blog</title>		<link>https://eng.uber.com/deep-neuroevolution/</link>		<description>&gt; a suite of five papers that support the emerging realization that neuroevolution, where neural networks are optimized through evolutionary algorithms, is also an effective method to train deep neural networks for reinforcement learning (RL) problems.		</description>		<dc:date>2017-12-19T09:26:01Z</dc:date>	</item>	<item rdf:about="https://deepmind.com/blog/alphago-zero-learning-scratch/">		<title>AlphaGo Zero: Learning from scratch | DeepMind</title>		<link>https://deepmind.com/blog/alphago-zero-learning-scratch/</link>		<dc:date>2017-10-18T22:43:19Z</dc:date>	</item></rdf:RDF>