Direct Preference Optimization: Your Language Model is Secretly a Reward Mode
RLHF is used for aligning language models with human preferences. However, RLHF is a complex and often unstable procedure. Proposed algorithm DPO is stable,...
RLHF is used for aligning language models with human preferences. However, RLHF is a complex and often unstable procedure. Proposed algorithm DPO is stable,...
Contributions:
As of 2023 November, jupyterlab still does not support working with virtual environments. Using global environment causes all sort of troubles. It is possib...
Paper provides details about data used in pre-training phase.
This article includes my notes on Llama 2 paper. All images if not stated oherwise are from the paper.