![Intelligent Projects Using Python](https://wfqqreader-1252317822.image.myqcloud.com/cover/975/36698975/b_36698975.jpg)
Q-learning
We will now look at a popular reinforcement learning algorithm, called Q-learning. Q-learning is used to determine an optimal action selection policy for a given finite Markov decision process. A Markov decision process is defined by a state space, S; an action space, A; an immediate rewards set, R; a probability of the next state, S(t+1), given the current state, S(t); a current action, a(t); P(S(t+1)/S(t);r(t)); and a discount factor, . The following diagram illustrates a Markov decision process, where the next state is dependent on the current state and any actions taken in the current state:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/83330458-c0b9-4660-a1e3-6cd4927244a6.png?sign=1739451376-1YQgR3SL9Erw0wz9Jome5x63jC8TWlg5-0-5c7a6b3780921255e0be9cb71180f17b)
Let's suppose that we have a sequence of states, actions, and corresponding rewards, as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/3c565184-9ac9-488b-8caa-6136105ef059.png?sign=1739451376-DuQ6g5QAfDZRWnb0yqOXqpIzzUk4c8GM-0-803023ed78fd91c854f6a6898d860852)
If we consider the long term reward, Rt, at step t, it is equal to the sum of the immediate rewards at each step, from t until the end, as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/94a7a7da-3528-4fb2-9e42-1af592e0c024.png?sign=1739451376-uydoD3e41lRT0VGg2pnofb80KQmtNzLl-0-3f8de87ff641c60d4f0eceb836b1fb0a)
Now, a Markov decision process is a random process, and it is not possible to get the same next step, S(t+1), based on S(t) and a(t) every time; so, we apply a discount factor, , to future rewards. This means that, the long-term reward can be better represented as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/4a4b48fe-0a74-4d0f-92ad-bad7fdc7fb35.png?sign=1739451376-Yb3iIFl8rcObLmvSIvmIU0HvD12WBnCl-0-eb211680721c9ed1db2467bbde5292fc)
Since at the time step, t, the immediate reward is already realized, to maximize the long-term reward, we need to maximize the long-term reward at the time step t+1 (that is, Rt+1), by choosing an optimal action. The maximum long-term reward expected at a state S(t) by taking an action a(t) is represented by the following Q-function:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/b3ecdb7c-d4cf-48a2-af93-c969b62fb41e.png?sign=1739451376-81V6RdtktTw8WibjNWKbrYbbsvFWhnRC-0-e0f5bf0c7120fd131ca795b534f45bb7)
At each state, s ∈ S, the agent in Q-learning tries to take an action, , that maximizes its long-term reward. The Q-learning algorithm is an iterative process, the update rule of which is as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/b6e4a367-e0aa-406d-bd8b-5e1b2cb8c1d7.png?sign=1739451376-3wIXA6rsLvMnbech083AS190LpghsruV-0-5a99b1886ed1e8cf1033b17359ee7bc5)
As you can see, the algorithm is inspired by the notion of a long-term reward, as expressed in (1).
The overall cumulative reward, Q(s(t), a(t)), of taking action a(t) in state s(t) is dependent on the immediate reward, r(t), and the maximum long-term reward that we can hope for at the new step, s(t+1). In a Markov decision process, the new state s(t+1) is stochastically dependent on the current state, s(t), and the action taken a(t) through a probability density/mass function of the form P(S(t+1)/S(t);r(t)).
The algorithm keeps on updating the expected long-term cumulative reward by taking a weighted average of the old expectation and the new long-term reward, based on the value of .
Once we have built the Q(s,a) function through the iterative algorithm, while playing the game based on a given state s we can take the best action, , as the policy that maximizes the Q-function:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/88e6ee5c-d64d-4d70-a9cf-ee3eb0e6c32a.png?sign=1739451376-WIDCaMLcRZ4ibPzkC9sbglbE4l7a3ErU-0-d64b8a829000533eefb63ae439fa5d14)