3-6: Q-learning

Q-learning is a model-less approach. It does not care about or use models of transition functions like T or rewards functions like R. Q-learning builds a table of utility values, or Q-values, at each step to select the best action based on what it has learned so far. The best thing about Q-learning is it is guaranteed to provide an optimal policy.

What is Q?

Q is similar to the other functions like T and R, however, it represents the immediate reward and discounted reward for an action taken in a specific state. Q is not greedy, and considers the discounted reward - so what happens today and tomorrow.

We use Q to generate a policy, pi, by finding the maximized argument value for action to maximize the Q value. Eventually, we'll have an optimal policy.


Q-learning procedure

A series of things occurs when training our Q-learning model and generating our Q table:

  • First, we select our training data
  • We iterate over this time-series data
    • We set a start time and initialize our Q table
    • We compute the starting state S given the inputs
    • We select our action A
    • We observe our end state S' and our reward
    • We update Q based on the outcomes
  • Once we've generated our Q table, we test the policy, pi against testing data
  • Finally, we repeat until convergence, resulting in no further improvements

A high-level overview of these concepts is provided below:


Update rule

One of the most important parts of Q-learning is the update rule. We can break it down into simpler terms wherein our new Q table, Q prime, is equal to our old Q table times 1 minus alpha, the learning rate, plus alpha times our reward plus gamma times the Q table value derived from alpha prime of our next decision to reach S prime. The writeup for the update rule equation is provided in the lecture slide below:


Two finer points

Q-learning is successful when we explore, and we can do that by taking random actions. We can choose random actions with probability of C. So as we begin to learn, we start out with C's probability of us choosing a random action to be pretty high. As we begin to learn more, our actions become less random.



For our stock trading Q-learner, the following are some actions that can be taken to transition to different states:

  • Buy
  • Sell
  • Nothing

Below is an example from the lectures:



Daily returns for a reward signal to the Q-learner allows the learner to converge must faster than a cumulative return.



Below are some good factors for state to be used during Q-learning:


Creating the state

In order for us to create the state, we have to translate all of our factors into discrete integers. Some algorithm will conduct this, and then we'll finally combine our factors into one number. Below is a lecture from the slides:



Below is a quick breakdown from the lecture on how to discretize our data: