Information Technology Reference

In-Depth Information

4.2 Action State

Let A be a set of actions, A=R, where, R is a set of the resources in the workflow

model. Let
a
(
s
)=
candidates
(
s.t
) be a set of the optional actions in the state
s
,

where
s.t
means the task
t
is allocated in state
s
.

4.3 Reward Function

In this paper, the objective is minimizing the flow time of the business process

cases. So the reward function should be defined as:

r
=

1

˃.fl

,
(
tisNone
)

(3)

0
,

(
else
)

Where,
t
is a task will be allocated,
˃
is the current determinative case,
˃.fl
is

the flow time of the case. The shorter flow time
˃.fl
takes, the higher immediate

payoff
r
returns. The agent will receive the reward when the last work item is

completed, otherwise the reward is zero.

5 Q-learning Algorithm for MDPs

Q-learning is a more suitable algorithm to solve the problem modeled by MDPs

[15]. In this section, the detail of the proposed Q-learning algorithm will be de-

scribed. In order to verify the influence of the previous resources on the candidate

resources, the Q-learning algorithm without SR is used as the basic experiment.

The basic experiment will be described in the first part, and the second part will

introduce the Q-learning algorithm with SR.

5.1 The Q-learning Algorithm without SR

In order to realize the Q-learning algorithm, the evaluation function
Q
(
s, a
)

and the estimation function
Q
(
s, a
) was defined [15]. According to MDPs, the

evaluation function
Q
(
s, a
) can be written as:

(
r
(
s, a
)+
ʳmax
a
Q
(
ʴ
(
s, a
)
,a
)

Q
(
s, a
)

≡

(4)

We can use
Q
to evaluate the Q function, then the agent can update
Q
(
s, a
)

according to the following rules:

Q
(
s, a
)

r
+
ʳmax
a
Q
(
s
,a
)

ₐ

(5)

For the situation of non deterministic reward and action, formula (5) will be

replaced by the following formula:

Q
n
(
s, a
)

ʱ
n
)
Q
n−
1
(
s, a
)+
ʱ
n
[
r
+
ʳmax
a
Q
n−
1
(
s
,a
)]

ₐ

(1

−

(6)

Where,

1

1+
visits
n
(
s, a
)

ʱ
n
=