Upload README.md
Browse files
README.md
CHANGED
|
@@ -20,9 +20,7 @@ The key ingredient of Implicit PRM is the reward representation, as demonstrated
|
|
| 20 |
<aside>
|
| 21 |
✨
|
| 22 |
|
| 23 |
-
***Proposition***
|
| 24 |
-
|
| 25 |
-
Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e.,
|
| 26 |
|
| 27 |
$$
|
| 28 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
|
|
@@ -34,13 +32,13 @@ $$
|
|
| 34 |
q_\phi^t(\mathbf{y}{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}{<i})}{\pi\text{ref}(y_{i}|\mathbf{y}_{<i})}.
|
| 35 |
$$
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
$$
|
| 40 |
q_\phi^t(\mathbf{y}{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}{\leq t})} \left[ e^{\frac{1}{\beta} r\phi(\mathbf{y})} \right]
|
| 41 |
$$
|
| 42 |
|
| 43 |
-
Hence, 
|
| 44 |
|
| 45 |
The proposition indicates that when modeling
|
| 46 |
|
|
@@ -48,7 +46,7 @@ $$
|
|
| 48 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
|
| 49 |
$$
|
| 50 |
|
| 51 |
-
to train an ORM with the standard pipeline, where
|
| 52 |
|
| 53 |
$$
|
| 54 |
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi\text{ref}(y_{t}|\mathbf{y}_{<t})}.
|
|
@@ -56,7 +54,13 @@ $$
|
|
| 56 |
|
| 57 |
Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
|
| 58 |
|
| 59 |
-
The proposition is agnostic to specific choices of the training objective of ORMs. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
$$
|
| 62 |
\beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
|
|
|
|
| 20 |
<aside>
|
| 21 |
✨
|
| 22 |
|
| 23 |
+
***Proposition***: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e.
|
|
|
|
|
|
|
| 24 |
|
| 25 |
$$
|
| 26 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
|
|
|
|
| 32 |
q_\phi^t(\mathbf{y}{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}{<i})}{\pi\text{ref}(y_{i}|\mathbf{y}_{<i})}.
|
| 33 |
$$
|
| 34 |
|
| 35 |
+
is the exponential average of $r_\theta$ at step $t$.
|
| 36 |
|
| 37 |
$$
|
| 38 |
q_\phi^t(\mathbf{y}{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}{\leq t})} \left[ e^{\frac{1}{\beta} r\phi(\mathbf{y})} \right]
|
| 39 |
$$
|
| 40 |
|
| 41 |
+
Hence, **$q_\theta^t$**represents an exact expectation of outcome reward $r_\theta$ at step $t$, i.e., the Q value.
|
| 42 |
|
| 43 |
The proposition indicates that when modeling
|
| 44 |
|
|
|
|
| 46 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
|
| 47 |
$$
|
| 48 |
|
| 49 |
+
to train an ORM with the standard pipeline, where $\beta$ is a hyperparameter, $\phi$ can implicitly learn a Q function. Hence, process reward $r_\phi^t$ can be obtained by:
|
| 50 |
|
| 51 |
$$
|
| 52 |
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi\text{ref}(y_{t}|\mathbf{y}_{<t})}.
|
|
|
|
| 54 |
|
| 55 |
Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
|
| 56 |
|
| 57 |
+
The proposition is **agnostic to specific choices of the training objective of ORMs**. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the
|
| 58 |
+
|
| 59 |
+
$$
|
| 60 |
+
r_\phi \left( \mathbf{y} \right)
|
| 61 |
+
$$
|
| 62 |
+
|
| 63 |
+
with
|
| 64 |
|
| 65 |
$$
|
| 66 |
\beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
|