Archit Sharma sur Twitter : "Direct Preference Optimization (DPO) allows you to fine-tune LMs directly from preferences via a simple classification loss, no RL required"
Tags:
About This Document
File info
Documents with similar tags (experimental)