From: Prophet Steering on
In Section 3, this problem is generalized to tbe one where the gambler
does not persist
in the very largest, but rather feels satisfaction provided that his
fortune upon stopping
is within $m$ units from the largest. More specifically, when the
allowance measure is $m$ $($
$m$ is a given positive integer), the gambler seeks an optimal
stopping time $\sigma_{m}^{*}$ such that
$P_{r} \{X_{\sigma_{m}^{*}}\geq 0\leq n\leq \mathrm{m}\mathrm{a}
\mathrm{x}TX_{n}-(m-1)|X_{0}=i\}=\max_{\sigma\in C}P_{r}\{x_{\sigma}
\geq 0\leq n\leq T\mathrm{m}\mathrm{a}\mathrm{x}X_{n}-(m-1)|X_{0}=i\}
$ . $(1.3)$
Obviously when $m=1$ , this problem is reduced to the one considered
in Section 2. It
can be shown that there exists an optimal stopping time $\sigma_{m}^{*}
$ of the following form;
$\sigma_{m}^{*}=\min\{n:X_{n}=0\leq j\leq n\mathrm{m}\mathrm{a}
\mathrm{x}X_{j}-(m-1)$ , $X_{n}<k_{m}^{*}\}$ ,
where $k_{m}^{*}$ is a positive integer defined as
$k_{m}^{*}= \min\{k$ : $\frac{1-\theta^{k}}{1-\theta^{Nm+1}-}\geq
\frac{\theta^{k}(1-\theta^{m})}{1-\theta^{k+m}}\}$ .
2 Stopping on the largest
Suppose that we enter a casino having an initial fortune of $x$ units
and bet one unit each
time. Then we either win one unit with probability $p$ or lose one
unit with probability
$q=1-p$, where $0<p<1$. The gambling process $\{X_{n};n\geq 0\}$ , a
Markov chain,
continues until it reaches its absorbing states $0$ or $N,$ $\mathrm{b}
\dot{\mathrm{u}}\mathrm{t}$ we can stop the process before
absorption, if we like. If we decide to stop at time $n$ , then we are
said to succeed if
$X_{n}= \max_{0}\leq j\leq TXj$ . The objective of the problem is to
find a stopping policy that will
maximize the probability of success.
Suppose that we have observed the values of $X_{0},$ $X_{1},$ $\cdots
$ , $X_{n}$ . Then serious decision of
either stop or continue takes place at time $n$ when $X_{n}$ is the
maximum value observed so
far, that is, $X_{n}= \max(x_{0}, x_{1,n}\ldots, x)$ . This decision
epoch is simply referred to as state
$x$ if $X_{n}=x$ , because, as a bit of consideration shows, the
decision depends neither on the
values of $X_{0},$ $X_{1},$ $\cdots$ , $X_{n-1}$ nor on the number of
plays already made.
Let $v(x)$ be the probability of success starting from state $x$ , and
let $s(x)$ and $c(x)$ be
respectively the probability of success when we stop in state $x$ and
we continue gambling
in an optimal manner in state $x$ ; then from the principle of
optimality.
$v(x)= \max\{S(X), C(X)\}$ , $0<x<N$ (2.1)
where
$s(x)=1-P_{x}(1, X)$ (2.2)
$c(x)=pv(x+1)+q(1-s(x-1))v(x)$ , $0<x<N$ (2.3)
with the boundary condition $v(\mathrm{O})=v(N)=1$ . (2.3) is
immediate from (1.2). (2.4)
follows because the next possible state visited (after leaving state $x
$ ) is either $x+1$ or
$x$ depending on whether we win one unit at the next play or lose one
unit at the next
150
play but attain state $x$ again before going broke.To solve the
optimality equation $(2.2)-$
(2.4), we invoke the result of positive dynamic programming (see, for
detail, chapter IV
of $\mathrm{R}\mathrm{o}\mathrm{s}\mathrm{S}[2])$ . The main result
(Theorem 1.2, p.75 of Ross) is that if the problem fits the
framework of the positive dynamic programming, then a given stationary
policy is optimal
$\mathrm{i}\mathrm{f}_{\ell}\mathrm{i}\mathrm{t}\mathrm{S}$ value
function satisfies the optimality equation. This gives us a method of
checking
whether or not a guessed policy can be optimal and is particularly
useful in cases in which
there exists an obvious optimal policy. Our problem in fact fits the
framework of positive
dynamic programming with the state being the gambler’s fortune, since
if we suppose
that a reward of 1 is earned if we attain the largest over all and all
other rewards are zero,
then the expected total reward equals the probability of success.
Theorem 2.1 Let $f$ be a stationary policy which, when the decision
process is in
state $x$ , chooses to stop if and only if $x<x^{*}$ , where
$x^{*}= \min\{x$ : $\frac{1-\theta^{x}}{1-\theta^{N}}\geq
\frac{\theta^{x}-\theta^{x+1}}{1-\theta^{x+1}}\}$ . (2.4)
Then $f$ is optimal.
Proof Let $v_{f}(x)$ denote the probability of success starting from
state $x$ when policy $f$ is
employed. Once the decision process leaves state $x(\geq x^{*})$ , it
never stops until absorption
takes place under $f$ . Thus we soon obtain
$v_{f}(x)=\{$
$s(x)$ , $\mathrm{i}\mathrm{f}x<x^{*}$
$P_{x}(N-x, x)$ , $\mathrm{i}\mathrm{f}x\geq x^{*}$ ,
(2.5)
and it is easy to check that $s(x)$ is decreasing in $x$ , while $P_{x}
(N-x, x)$ is increasing in $x$
and that $x^{*}$ can be described as
$x^{*}= \min\{x : s(x)\leq P_{x}(N-x, x)\}$ .
Therefore, to prove $f$ is optimal, it suffices to show that, from
$(2.2)-(2.4),$ $v_{f}(x)$ given in
(2.5) satisfies
$v_{f}(x)= \max\{S(X),pvf(x+1)+q(1-S(x-1))vf(x)\}$ , $0<x<N$. (2.6)
We now show this distinguishing among three cases.
Case(i): $x<x^{*}-1$
Since $v_{f}(x)=s(x)$ , to show the validity of (2.6), it suffices to
show that
$v_{f}(x)\geq pv_{f}(x+1)+q(1-S(X-1))v_{f}(X)$ ,
or
$s(x)\geq ps(X+1)+q(1-S(_{X1))_{S(}}-X)$ ,
151
or equivalently
$\{s(X)-S(X+1)\}+\theta s(x+1)s(x)\geq 0$ ,
which is immediate since $s(x)$ is non-negative and decreasing in $x
$ .
Case(ii): $x\geq x^{*}$
Since $v_{f}(x)=P_{x}(N-x, x)\geq s(x)$ , it suffices to show that
$v_{f}(x)\geq pv_{f}(x+1)+q(1-S(x-1))vf(X)$ ,
or equivalently
$P_{x}(N-X, x)\geq pP_{x+1}(N-X-1, x+1)+qP_{x-1}(1, x+1)P_{x}(N-x, x)
$ ,
which in fact holds with equality, because. straigh.$\mathrm{t}
\mathrm{f}_{0}\mathrm{r}\mathrm{W}\mathrm{a}\mathrm{r}\mathrm{d}
\mathrm{c}\mathrm{a}1_{\mathrm{C}\mathrm{u}}1\mathrm{a}\mathrm{t}
\mathrm{i}\mathrm{o}\mathrm{n}'$ . from (.1.2) leads to
the following identity;
$P_{x}(N-x, x)=pP_{x+1}(N-x-1, X+1)+qP_{x-1}(1, X-1)P(xN-X, x)$ .
(2.7)
It is easy to interpret this identity probabilistically by
conditioning on the result of the
first play after leaving state $x$ .
Case(iii): $x=x^{*}-1$
Since $v_{f}(x^{*}-1)=s(x^{*}-1)$ , it suffices to show that
$v_{f}(x^{*}-1)\geq pv_{f}(x^{*})+q(1-S(X-*2))v_{f(-1}x*)$,
or
$s(x^{*}-1)\geq pP_{x^{*}}(N-x^{*}, x^{*})+qP_{x^{*}-2}(1,
X^{*}-2)s(X^{*}-1)$ ,
or
$s(x^{*}-1) \geq\frac{pP_{x^{\wedge()}}N-XX*,*}{1-qP_{x^{*}-}
2(1,x^{*}-2)}$ ,
or equivalently from (2.7)
$s(x^{*}-1)\geq P_{x^{\mathrm{r}}-1}(N-x^{*}+1, x^{*}-1)$ ,
which is obvious from the definition of $x^{*}$ . Thus the proof is
complete.
3 Stopping on one of $m$ largest
In this section, we are concerned- with stopping on one of $m$ largest
of the gambling
process. That is, we wish to find an optimal stopping time $\sigma_{m}
^{*}$ which satisfies (1.3).
Suppose that we have observed the values of $X_{0},$ $X_{1}$
) $\ldots,$
$x_{n}$ . Then serious decision of
either stop or continue takes place at time $n$ if $X_{n} \geq
\max_{0\leq j\leq}X_{j}n-(m-1)$ . We
denote this state by $(x, i)$ if $X_{n}=x$ and $X_{n}= \max_{0\leq j
\leq}nX_{j}-(i-1),$ $0<x<N-$
$m,$ $i=1,2,$ $\ldots,$ $m$ . Our trial is now regarded as successful
if we stop at time $n$ and
152
$X_{n} \geq\max_{0\leq j\leq\tau}X_{j}-(m-1)$.Let $v_{i}(X),$ $1\leq i
\leq m$, be the probability of success starting
from state $(x, i)$ , and let $s_{i}(x)$ and $c_{i}(x)$ be
respectively the probability of successes when
we decide to stop and when we decide to continue in an optimal manner
in state $(x, i)$ ,
then from the principle of optimality,
$v_{i}(x)= \max\{s_{i}(x), C_{i}(X)\}$ , $1\leq i\leq m$ , $0<x<N-m$
(3.1)
where
$s_{i}(x)\equiv s(x)=1-P_{x}(m,..x)$ $1\leq Ai\leq m$ (3.2)
and
$c_{1}(x)=pv_{1}(x+1)+qv_{2}(x-\mathrm{i})$ (3.3)
$c_{i}(x)=pv_{i-1}(x+1)+qv_{i+1}(x-1)$ , $1<i<m$ (3.4)
$c_{m}=pv_{m-1}(X+1)+qP_{x-1}(1, X-1)vm(x)$ . (3.5)
Assume that we choose to be continue in state $(x, i),$ $1\leq i<m$ ,
if $s_{i}(x)=c_{i}(x)$ . Then as
the next lemma shows, we never stop in state $(x, i),$ $1\leq i<m$ .
Lemma 3.1 For $1\leq i<m$ ,
$c_{i}(x)\geq s_{i}(X)$ .
Proof. Since, from the definition,
$c_{i}(x)\geq ps(x+1)+qS(_{X}-1)$ , $1\leq i<m$ ,
to show $c_{i}(x)\geq s_{i}(x)$ , it suffices to show that
ps $(_{X+}1)+qs(x-1)\geq s(x)$ ,
or
$\frac{1-\theta^{x}}{1-\theta^{x+m}}-\{p\frac{1-\theta^{x+1}}{1-
\theta^{x+}m+1}+q\frac{1-\theta^{x-1}}{1-\theta^{x}+m-1}\}\geq 0$ .
(3.6)
Because $p=1/(1+\theta),$ $q=\theta/(1+\theta)$ , after a bit of
calculation, the left-hand side of (3.6)
becomes
$\theta^{2x+m-}1(\frac{1-\theta^{m}}{1-\theta})$ ..
$\overline{(\frac{1-\theta^{x}+m-1}{1-\theta})(\frac{1-\theta^{x+m}}{1-
\theta})(\frac{1-\theta^{x+}m+1}{1-\theta}\mathrm{I}}$
’
which is obviously non-negative and proves (3.6). $\mathrm{t}\backslash
$ : i. $\cdot$ -
$\dot{\mathit{9}}$ .
.. . .,$\cdot$
$\cdot$ .
From Lemma 3.1, we are only concerned with $\mathrm{t}\mathrm{h}
\dot{\mathrm{e}}$ optimal decision in state $(x, m).$ The
following lemma expresses $c_{m}(x)$ in terms of $v_{m}(.)$ .
Lemma 3.2 Define $P=(1-\theta^{m-1})/(1-\theta^{m})$ . Then, for $x
\leq N-2m$ ,
153
$c_{m}(x)$ $=$ $\{p\theta P+qP_{x-1}(1, x-1)\}vm(x)$
$+p(1- \theta P)\sum^{N}y=x+1-2m+1P^{y-x-1}(1-P)v_{m}(y)$
$+p(1-\theta P)P^{N-}x-2m-1$ .
Proof. Let $p(x, y)$ be the transition probability from state $(x, m)$
to state $(y, m),$ $y\geq x$ .
Then straightforward calculation yields
$p(x, y)-.$ $=$ $\{$
$qP_{x-1}(1, X-1)+p[1-P_{x+1}(m-1,1)]$ , if $y=x$
$pP_{x+1}(m-1,1)[\Pi_{i=0}^{yx}--2Px+m+i(1, m-1)][1-P_{y+}m-1(1, m-1)]
$ ,
if $x<y\leq N-2m+1$
$=$ $\{$
$qP_{x-1}(1, X-1)+p\theta P$, if $y=x$
$p(1-\theta P)P^{y}-x-1(1-P)$ , if $x<y\leq N-2m+1$ .
The remaining probability
$1-\Sigma_{y=}^{N-2}xp(_{X}m+1, y)=p(1-\theta P)P^{N-}x-2m+1$
corresponds to the probability that the gambling process $\{X_{n}\}$
attains its absorbing state
$N-m+1$. Thus the proof is complete.
We now have the following identity.
Lemma 3.3 For $m\geq 1$ and $x\leq N-2m$,
$P_{x}(N-m+1-x, x)$ $=$ $\{p\theta P+qP_{x-1}(1, X-1)\}P_{x}(N-m+1-x,
x)$
$+p(1- \theta P)\sum^{N}y=x+1(-2m+1P^{y-}x-11-P)P_{y}(N-m+1-y, y)$
$+p(1-\theta P)P^{N-2}-xm-1$ .
Proof. We can prove this by straightforward calculation from (1.2),
but this identity is
quite intuitive from Lemma 3.2.
We are now ready to prove the main result.
Theorem 3.4 Let $f$ be a stationary policy which, when the decision
process is in
state $(x, m)$ , chooses to stop if and only if $x<x^{*}$ , where
$x^{*}= \min\{x$ : $\frac{1-\theta^{x}}{1-\theta^{Nm+1}-}\geq
\frac{\theta^{x}-\theta^{x+m}}{1-\theta^{x+m}}\}$ .
154
Then $f$ is optimal.
Proof Let $v_{f}(x)$ denote the probability of success starting from
state $(x, m)$ when policy
$f$ is employed. Once the decision process leaves state $(x, m)$ for $x
\geq x^{*}$ , it never stops
until absorption takes place under $f$ . Thus we have
$v_{f}(x)=\{$
$s(x)$ , $\mathrm{i}\mathrm{f}x<x^{*}$
$P_{x}(N-m+1-x, x)$ , $\mathrm{i}\mathrm{f}x\geq x^{*}$ ,
(3.7)
and it is easy to check that $s(x)$ is decreasing in $x$ , while $P_{x}
(N-m+1-X, x. )$ is increasing
in $x$ and that $x^{*}$ can be described as
$x^{*}= \min\{x : s(X)\leq P_{x}(N-m+1-x, x)\}$ .
Therefore, to prove $f$ is optimal, it suffices to show that $v_{[}(x)
$ given in (3.7) satisfies
$v_{f}(x)= \max\{s(X), cf(x)\}$, $0<x<N-m+1$,
where
$c_{f}(x)$ $=$ $\{p\theta P+qP_{x-1}(1, X-1)\}vf(X)$
$+p(1- \theta P)\sum y=x+1^{+}(N-2m1P^{y1}-x-1-P)vf(y)$
$+p(1-\theta P)P^{N-}x-2m-1$
We can show this in a similar way as in Theorem 2.1, appealing to
Lemma 3.3.
Finally we give $V(x)$ , success probability starting from $\mathrm{x}
$ units of fortune.
Lemma 3.5 Let $P$ be as defined in Lemma 3.2. Then, for $m-1\leq x\leq
N-m$,
$V(x)= \sum^{N-2}-m+1{}_{r=x}Pm+1y-x+m-1(1-P)vf(y)+P^{N-m-x}+1$ .
Proof This can be derived easily by conditioning on the first state $
(y, m)$ visited.
References
[1] Ross, S. M., ”Dynamic programming and gambling models”, adv. Appl.
Probab. 6,
593-606, 1974.
[2] Ross, S. M., Introduction to Stochastic Dynamic Programming.
Academic Press,
New York, 1983.
155