Saturday, November 26, 2022
Google search engine
HomeReinforcement LearningReinforcement Learning and its Application

Reinforcement Learning and its Application

What is Reinforcement Learning?

Reinforcement learning is a semi-supervised Machine Learning algorithm which permits software agents and machines to automatically determine the ideal performance within a specific context, to maximize its performance. This technique allows an agent to take actions and interact with an environment, learn from that interaction and then maximize the total rewards. RL, is all about taking decisions sequentially. In simple words we can say that the output depends on the state of the present input and the next input depends on the output of the previous input.

Reinforcement learning (RL) differs from the other machine learning approaches in a way that in other techniques, the algorithm is not explicitly told how to perform a task, but works through the problem on its own.

In Supervised machine learning technique, we have input and output in the form of labeled data that we used to train our system. Whereas in reinforcement learning, there is no training data that the reinforcement agent decides what to do to perform the given task. In the absence of training dataset, it is bound to learn from its experience.

Reinforcement Learning in Artificial Intelligence

Reinforcement learning in AI consists of a collection of computational approaches that are primarily motivated by their potential for resolving practical problems. RL is inspired by behaviorist psychology. It is similar to how a child learns to perform a new task. As an AI agent, which could be a self-driving car or a program playing chess, interacts with its environment, receives a reward state depending on how it performs, such as driving to destination safely or winning a game. Conversely, the agent receives a penalty for performing wrongly, such as going off the road or being checkmated.

Reinforcement Learning involves these simple steps:

  • Observation of the given environment
  • Deciding how to perform using some approach
  • Act accordingly
  • Receiving a penalty or reward
  • Learning from the past experiences and refining our approach
  • Iterate until an optimal approach is found

Advantages in AI

The AI agent over time makes decisions to maximize its reward and minimize its penalty using dynamic programming. The advantage of this approach to artificial intelligence is that it allows an AI program to learn without a programmer spelling out how an agent should perform the task.

Reinforcement Learning Algorithm

There are different algorithms that are used in reinforcement learning. We are discussing Q learning algorithm for understanding.

Q-Learning Algorithm

Q Learning is a simplest form of Reinforcement Learning which uses action values and Q-values to improve the performance of the learning agent. Q-learning lets the agent use the environment’s rewards to learn, over time, the best action to take in a given state. The main steps in Q learning Algorithm are:

Action values and Q-values

These values are estimation that, what action agent take to reach the goal state. Q (S, A) are defined as action and states.

Price and Episodes:

The agent takes a number of transitions from the current state to reach the goal state. These transitions based on number of actions which agent take while interacting with environment. The price and penalty added after each action.  The agent reaches the goal state and complete his one episode successfully.

Temporal difference

This rule is used to estimate the Q value after each action of agent.  

S: Current state of the agent

A: Actions which agent take

S: Next state where agent have to move

A: By using current Q states pick up the next action.

R: Current Reward of agent

α: Length of steps taken to update the estimation function

ϒ: Discovering factor for future reward

Choosing the suitable action to take using ꓰ-Greedy approach

 This approach chooses the action using Q-value estimation.

  • Choose the actions which have highest Q-values
  • Choose the actions at random

Reinforcement Learning Example:

We have an agent robot and a goal state, with many obstacles in between. The agent is supposed to discover the best possible route to reach the goal.  

The above figure shows robot agent, diamond and fire. The goal of the robot is to get the prize that is the diamond and avoid the obstacles that is fire. The robot learns by trying all the possible pathways and then picking the route which gives him the reward with the minimum hurdles. Each correct step will give the robot agent a reward and each incorrect step will deduct the reward of the robot agent. The total reward will be calculated when it reaches the final reward that is the diamond.

Main points in Reinforcement learning


The input is an initial state from which the agent will start.


There are many possible ways as there are multiple solution to a particular problem


The training of agent is based on the input; the model will return back a state and the user will decide to give reward or punish the agent totally based on its output. The agent keeps continues to learn.


The best result is decided based on the maximum reward.

Reinforcement learning Applications

Reinforcement learning has been applied successfully in a number of areas, which has produced some successful practical applications.

  • In industrial automation robotics control system are developed using reinforcement learning.
  • RL use to solve combinatorial search problems such as computer game playing
  • Machine learning and data processing
  • RL, can be used to create training systems that provide custom instruction and materials according to the requirement of students
  • In Energy consumption optimization
  •  Video games

Reinforcement learning python implementation using Q-learning

In this implementation, we try to teach a bot to reach its goal using the Q-Learning technique. The main steps in python that should be followed are given below.

Step 1:  Importing the compulsory libraries in python

import numpy as np                                                                   import pylab as pl                                                                     import networkx as nx   

Step 2: Initialize the Q-table

q_table =
np.zeros([env.observation_space.n, env.action_space.n])

Step 3: Create the training algorithm that will update this Q-table 

 import random 
from IPython.display import clear_output
 alpha = 0.1 
 gamma = 0.6 
 epsilon = 0.1 
 all_epochs = [] 
 all_penalties = [] 
 for i in range(1, 100001): 
 state = env.reset() 
epochs, penalties, reward, = 0, 0, 0 
     done = False 
     while not done: 
         if random.uniform(0, 1) < epsilon: 
             action = env.action_space.sample() # Explore action space 
             action = np.argmax(q_table[state]) # Exploit learned values 
         next_state, reward, done, info = env.step(action)
         old_value = q_table[state, action] 
         next_max = np.max(q_table[next_state])   
         q_table[state, action] = new_value 
         if reward == -10: 
             penalties += 1 
         state = next_state 
         epochs += 1   
     if i % 100 == 0:  
         print(f"Episode: {i}")
 print("Training finished.\n") 

  Output :

 Episode: 100000
 Training finished.
 Wall time: 30.6 s 

Step 4: Q-table has been created over 100,000 episodes

 array([ -2.30108105,  -1.97092096,  -2.30357004,  -2.20591839,
        -10.3607344 ,  -8.5583017 ]) 

Step 5: Evaluating the agent

In the step we Evaluate the performance of our agent.

 total_epochs, total_penalties = 0, 0
 episodes = 100
 for _ in range(episodes):
     state = env.reset()
     epochs, penalties, reward = 0, 0, 0
     done = False
     while not done:
         action = np.argmax(q_table[state])
         state, reward, done, info = env.step(action)
         if reward == -10:
             penalties += 1
         epochs += 1
     total_penalties += penalties
     total_epochs += epochs
 print(f"Results after {episodes} episodes:")
 print(f"Average timesteps per episode: {total_epochs / episodes}")
 print(f"Average penalties per episode: {total_penalties / episodes}") 


Results after 100 episodes:

Average timesteps per episode: 12.3

Average penalties per episode: 0.0

By using this code we can apply the Reinforcement learning and measure the performance of our agent.

Reinforcement learning



Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular