Reinforcement Q-Learning Car


I. Motivation

I consider data science one of the sexiest and most fascinating subjects. Not only for the possibility to tell a story from the data, but also the ability to integrate academic research aspects into a corporate culture. For example, with the help of data, we can build up recommendation engines, predictive models, risk evaluation systems, and so forth, and apply these results to real world demands. There are so many things that people can learn from data after implementing data analytical skills. I started to ask myself the question: since human being can learn from data, can non-human being learn from data? Beginning with this idea, I decided to build a car that can learn the driving path by itself.

II. Pipeline

The whole project is comprised of three components  (Figure 1):

  • Training Field
  • Car Model
  • Learning Algorithm


Figure 1. Components of the project


III. Dependencies & CodeBase Overview

Python Dependencies

The project was developed in Python environment. 5 key Python modules were imported:

  • numpy
  • matplotlib
  • sklearn
  • keras
  • pygame

The first three modules were utilized to perform fundamental mathematical calculations, such as distance metric, collision detection, angular displacement, and so forth. The "keras" module was imported in supporting neural network algorithm, using "Tensorflow" backend. The "pygame" module was used to create the training field and car object to obtain data for the learning purpose.

CodeBase Overview

There project is comprised of 9 codebases:

Name Type Description  Execution Script  Main entry point into program from a terminal  Execution Script  Creates charts and statistics  Class  Works as a helper file containing basic geometry functions to compute distance, detect collision, and  so forth  Class  Simulates 4 square obstacles that can be instantiated to follow a fixed path of circular motion  Class  Simulates multiple square obstacles that can be instantiated to move in random direction in non-stationary speed  Class  A base implementation of table-based q learning  Class  An implementation of table-based q learning adding epsilon to perform random selections  Class  An implementation of neural-net-based q learning  Class  Creates a car with a set of sensors

Chart 1. Overview of codebases

IV. Key Components


a. Training Field

The final design of the training field has two different types of obstacles.


Figure 1. Display of the training field

  • Obstacles with fixed moving direction

There are 4 bigger squares moving around a point on an orbit, which takes the center of the training field as its disk center. These squares move in a stationary angular speed and with a fixed pattern.


  • Obstacles with random moving direction

There are 20 smaller squares moving in random direction in the training field with varying speed. They will move back to the field after getting too close to the edge of the training field.


At first, the training field had only stationary obstacles. The car learned very fast and its surviving time went to more than 3 hours after 30,000 trials. However, it cannot be announced at this point that the car has successfully learned to drive by itself, for the driving pattern of the car was similar to the shape of a Mobius, driving around the stationary obstacles again and again. The result of the driving pattern was considered to be overfitted: even though the survival time of the car is impressive, its monotonous driving pattern was far from being qualified to be applied into a more complex training field. Therefore, obstacles with different type of movement were added into the training field as one of the method to overcome over-fitting.


b. Car

The simulated car has a circular shape as shown in Figure 2(a). In order to transform the environment near the car into numbers and store them into a Python dictionary, ten sensors were used to detect the distance from the car to any object; five of them were front sensor and the rest were rear sensor, trying to represent the view of road condition from a driver in a real-world car. The sensor points were used to provide an visual detection when the sensors receive the signal indicating obstacles nearby. The car can act in five patterns as shown in Figure 2(b):

  1. Forward
  2. Right Forward
  3. Left Forward
  4. Right Rotation
  5. Left Rotation


(a) Sensor Input  

4 (b) Motor Output

Figure 2. The simulated car

In a nut shell, the learning rule of the car can be refined as the following:

  • The car moves around the training field frame by frame.
  • Whenever the car hits the periphery or the obstacles, it will "crash" and go back to its original point.
  • The car has three actions: Turn Left, Turn Right, and Go Straight.
  • 10 sensors stretch out to sense the distance between the car and obstacles, making up the state space.


c. Learning Algorithm

The method I used here to train the car is reinforcement learning. Reinforcement learning is a term borrowed from the study of animal learning, and is also related to theory of learning automata in control engineering. It is a learning technique based on trial and error. The core of reinforcement is that a complex task can be solved through repeated interaction with the environment. (Figure 3.)


Figure 3. Reinforcement learning system

Generally speaking, a basic reinforcement learning model is comprised of:

  1. a set of environment and agent states S
  2. a set of actions A of the agent
  3. policies of transitioning from states to actions
  4. rules that determine the scalar immediate reward of a transition
  5. rules that describe what the agent observes

The learning system receives state information about the environment by means of sensors on the car, and the state information is used by a reasoning process to determine the action to be given in the given state. After the action is executed, the learning system receives a reinforcement signal from the environment to indicate the consequences of its actions. The goal of a reinforcement learning agent is to collect as much reward as possible, so as to maximize a numerical cumulative reward signal.

V. Q-Learning

Q-learning is an example of a reinforcement learning technique used to train robots to develop an optimal strategy to solving a task. It uses an action-value function that calculates the expected utility of performing a specific action in a discrete state and following the optimal policy thereafter. A table is used to store the utility data as the car drives in the training field.


Figure 4. Structure of Q-learning

Q-learning is a model-free reinforcement learning technique, consists of an agent, states S and a set of actions per state A. Executing an action in a specific state provides the agent with a reward. Each time the agent performs an action A in state S at time t, the agent gets a reward r with represents how close it is to solving the task. The utility for performing this action is then updated according to the update rule:

Q(S, A)    ←    Qt(S, A) + α * [r + γ * maxα'Q(S', A') – Q(S, A)]


Q(S, A)    ←    (1 – α) * Qt(S, A) + α * [r + γ * maxα'Q(S', A')]


  • S is the current state
  • A is the optimal action in S
  • S' is the next state
  • A' is the optimal action in S'
  • r is the reward
  • α is the learning rate
  • γ is the discount factor.

Descriptive Information

  • State (S and S')

The sensors on the car were designed to obtain a representation of the current state where the car is in. Therefore, the state is a summary of the environment the car is perceiving. In the representation of the environment, the state is represented as a 4-tuple (w, x, y, z). The variable w is associated with the sensor signals. If the sum of readings in left sensors are same as that of the sensors on the right, w is given a value of 0. If the right sensors receive greater sum readings, w is set to 1; and w is set to 2 if the left sensors receive greater sum readings. The variable x is a binary variable which monitors proximity to an obstacle in front of a car. If the sensors near the car receive any signal, x is set to 1. The same rule applies for y except that it represents to the rear sensors. The final variable, z, is a binary variable that is activated if the sensors around the car sense nothing in a time step. The entire state space is therefore made of 3 x 2 x 2 x 2 combinations of the percepts, or 24 unique states (Chart 2.).

Percept 0 1 2
Sensor Readings (w) Right == Left Right > Left Right < Left
Obstacle in front (x) No Yes  
Obstacle at rear (y) No Yes  
Sensors sense nothing (z) No Yes  

Chart 2. The different percepts that are used to define a state

  • Action (A and A')

For every time-step, the car is able to find the reinforcement value (r) that it is awarded for performing a specific action in its current state. The car will need to choose the action with the maximum q-value in any state so that the car can follow a completely optimal strategy. When the q-value function has converged to the real payoffs for the state and action pair (Q(S, A)), then the policy the car will make would be optimal.

The actions the car can take in any given state are represented as one variable (p). The total action space is shown in Chart 3. below.

Action Car Commands (P)
0 Forward
1 Left
2 Right

Chart 3. The entire action space for the car

In order to maintain balance between exploration and exploitation, and allow the car to effectively learn during the early learning stage, "epsilon" is introduced, which is as epsilon-greedy policy. Every time the car makes a decision, a number (z) will be generated, ranging from 0 to 1. Prior to making next action, z and ε will be compared, and if z is less than ε, we randomly make a decision (exploration); if z is greater than ε, we find the action with maxQ (exploitation). The value of ε ranges from 0 to 1, and the greater the value is, the more randomness is appended in decision-making of the q-learning agent.

  • Reward (r)

The reward scheme is the core of the entire q-learning algorithm. In order to realize reinforcement learning, it is crucial that the car would be able to determine the approximate value of an action immediately after performing the action at certain state. Since the goal for the car is to survive in the training field as long as possible, it makes sense to reward the car for actually receive no signal from sensors. If sensors sense nothing, the car is given a reward of 100; if sensors sense anything, the car is given a reward of -20, and if the car hit anything, it will be given a reward of -500 as a punishment.

  • Learning rate (α)

The learning rate determines to what extent the newly acquired information will override the old information. A value of 0 will make the agent not learning anything, while a factor of 1 would make the agent consider only the most recent information.

  • Discount factor (γ)

The discount factor determines the importance of future rewards. A value of 0 will make the agent short-slighted by only considering current rewards, while a factor approaching 1 will make the agent strive for a long-term high reward.


5.1 Table Based Q-Learning

The table-based Q-learning used a Q-table to store the Q-values. For the environment of our training field, we have 24 states x 3 actions resulting in a table with 72 entries. The learning procedure for the robot when using the car can be described as shown below.


Figure 5. The action-perception loop for table based Q-learning


5.2 Neural Network Based Q-Learning

The Q-table implementation of Q-Learning is simply to apply. However, as the state-action space becomes more complicated, training the agent using table based approach will be very time consuming since the memory needed to train and update the table will increase tremendously. In order to overcome the problem brought by the complexity of state-action space, Q-learning can be developed using a neural network as a function to approximate Q-values.

The neural net uses interconnected nodes to relay signals across the structure, from an input source through to an output. Each connection between nodes has a weight parameter that affects what values the nodes receive. The neural network is trained by comparing the output to a target for a given set of inputs, and generating an error value. This error is then used to update the connections and/or the weight of those connections in the neural net. The learning procedure for the car when using the table can be described as shown below.


Figure 6. The action-perception loop for ANN based Q-learning

A neural network with 8 input nodes (6 for the state representation and 2 for the motor actions), 4 hidden nodes and 1 output node is built, together forming a network structure with one hidden layer.  The neural network accepts a state and an action as input and outputs the approximated Q-value for that state-action pair (Q(S,A)).

Qtarget(S, A) is generated according to primary equation and used to train the net as shown in Figure 7.


Figure 7. Neural net layout for approximating the desired Q-values

VI. Results & Conclusions

The goal of this project is to encourage the car to learn a driving strategy to avoid obstacles and survive as long time as possible. In order to achieve this, the car is allowed to driving around in the training field until it appeared to have learned a successful approach to the goal.

In order to compare the learning performance of table based q-learning and neural network based q-learning, I used survival time as the metric to determine how well had the car learned to drive in the training field. The pygame demo was limited to 60 frames per second, which means the number of frames is sixty if the car can survive one second (1 sec. = 60 fps.).



(a) Q-Table based           10000 trials [Left]          30000 trials [Right]


(b) Neural Network based          10000 trials [Left]          30000 trials [Right]

Figure 8. Survival time of different Q-learning

Q-Learning  Table Based  Table Based  Neural Network Based  Neural Network Based
 Num. of Trials  10000  30000  10000  30000
 Mean  201.9742  758.1608  148.2468 158.3816
 Standard Dev.  168.4427  887.1238  98.7804  94.5821
 Max. Num. of Frames  2105  11693  1358  1386
 Max. Survival Time (in sec.)  ~35  ~195  ~23  ~23

Chart 4. Descriptive statistics information

It can be observed from Figure 8. that Q-learning provides an effective unsupervised learning technique for training the car. Positive correlation between number of trials and survival time can be inferred. However, neural network based Q-learning, though improved over time, improved not as same rate as table-based Q-learning. On the other hand, although table based Q-learning can give us a much longer survival time than using neural network based Q-learning approach, the performance of the car is far less stable. Therefore, it is acceptable to implement table based Q-learning as a learning approach for self-learning car for demonstrating purpose, while in the real-world perspective, neural network based Q-learning approach is more reliable.

VII. Demonstration



VIII. Limitations & Future Directions

There are many limitations in this project due to either technical difficulties or computational limitations. The limitations and future directions are the following:

Limitations Future Directions
  • The driving speed of our car was stationary, which disabled the ability to perform emergency response after sensing being surrounded by obstacles.
  • Enable the car to driving in different speed depending on the distance between the car and obstacles, number of obstacles around the car, and the direction of the obstacles.
  • The reward for sensors sense obstacles was assigned a constant value, which passed same state value no matter how close the car was to the obstacles, leading to same action value.
  •  Same as above.
  •  Sensors were categorized into two groups: front and rear, despite there were 10 sensors around the car.
  •  Categorize the sensors into 10 groups, in order to better represent the rich environment.
  •  The structure of the neural network is too naive to be competent in learning the rich environment.
  •  Build more complex neural networks and compare their result using uniform metric (i.e. loss).
  •  This is a 2-D self-learning car.
  •  In a 3-D perspective, image processing framework is needed. OpenCV, CNN, along with other hardwares, such as Raspberry Pi,  can be used to develop a 3-D self-learning car.



[1]  Mendel.J.M, Mclaren.R.W. Reinforcement-learning control and pattern recognition systems. In adaptive learning and pattern recognition systems: Theory and Application, eds J.M.Mendel, K.S.Fu, Academic Press, New York, NY. 287-318, 1970.

[2] Narendra.K, Thathachar.M.A.L. Learning automation: An Introduction. Prentice-Hall, Englewood Cliffs, NJ. 1989.

[3] Sameer.M.P, Devendra.P.G. Fuzzy-logic-based reinforcement learning of admittance control for autonomous robotic manufacturing. Engineering Application of Artificial Intelligence, Vol.11, 7-23, 1998.

[4] B. Huang, G. Cao and M. Guo, Reinforce- ment Learning Neural Network To The Problem of Autonomous Mobile Robot Obstacle Avoid- ance, Proceedings of the 4th International Con- ference on Machine Learning and Cybernetics, Guangzhou, 2005.

[5] B. Bagnall, Building a Light-seeking Robot with Q-learning,, April 19, 2002.

[6] Mnih et al, Human-level control through deep reinforcement learning, 2015.

[7] M. McPartland and M. Gallagher, “Reinforcement learning in first person shooter games,” Computational Intelligence and AI in Games, IEEE, no. 1, pp. 43 – 56, March 2011.

[8] M. Wiering and M. van Otterlo, Reinforcement Learning: State of the Art. Springer Verlag, 2012.




Boyang Dai
Boyang Dai
Eat and pray, love the data.


  1. bo liu says:

    Hi Boyang Dai:
    This is a good paper, can you provide the source code for learning?

Leave a Reply

Your email address will not be published. Required fields are marked *