PRIME-RL
/

EurusPRM-Stage2

Safetensors

qwen2

Model card Files Files and versions

xet

Community

yuchenFan commited on Jan 2, 2025

Commit

958c8e6

1 Parent(s): 452f3d6

Upload README.md

Browse files

Files changed (1) hide show

README.md +11 -7

README.md CHANGED Viewed

@@ -20,9 +20,7 @@ The key ingredient of Implicit PRM is the reward representation, as demonstrated
 <aside>
 ✨
-***Proposition***
-Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e.,
 $$
 r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
@@ -34,13 +32,13 @@ $$
 q_\phi^t(\mathbf{y}{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}{<i})}{\pi\text{ref}(y_{i}|\mathbf{y}_{<i})}.
 $$
-Here,  is the exponential average of  at step .
 $$
 q_\phi^t(\mathbf{y}{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}{\leq t})} \left[ e^{\frac{1}{\beta} r\phi(\mathbf{y})} \right]
 $$
-Hence,  represents an exact expectation of outcome reward  at step , i.e., the Q value.
 The proposition indicates that when modeling
@@ -48,7 +46,7 @@ $$
 r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
 $$
-to train an ORM with the standard pipeline, where  is a hyperparameter,  can implicitly learn a Q function. Hence, process reward  can be obtained by:
 $$
 r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi\text{ref}(y_{t}|\mathbf{y}_{<t})}.
@@ -56,7 +54,13 @@ $$
 Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
-The proposition is agnostic to specific choices of the training objective of ORMs. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the  with
 $$
 \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.

 <aside>
 ✨
+***Proposition***: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e.
 $$
 r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
 q_\phi^t(\mathbf{y}{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}{<i})}{\pi\text{ref}(y_{i}|\mathbf{y}_{<i})}.
 $$
+is the exponential average of $r_\theta$ at step $t$.
 $$
 q_\phi^t(\mathbf{y}{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}{\leq t})} \left[ e^{\frac{1}{\beta} r\phi(\mathbf{y})} \right]
 $$
+Hence, **$q_\theta^t$**represents an exact expectation of outcome reward $r_\theta$ at step $t$, i.e., the Q value.
 The proposition indicates that when modeling
 r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
 $$
+to train an ORM with the standard pipeline, where $\beta$ is a hyperparameter, $\phi$ can implicitly learn a Q  function. Hence, process reward $r_\phi^t$ can be obtained by:
 $$
 r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi\text{ref}(y_{t}|\mathbf{y}_{<t})}.
 Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
+The proposition is **agnostic to specific choices of the training objective of ORMs**. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the
+$$
+r_\phi \left( \mathbf{y} \right)
+$$
+with
 $$
 \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.