What is Reinforcement Learning?
Reinforcement learning is a semi-supervised Machine Learning algorithm which permits software agents and machines to automatically determine the ideal performance within a specific context, to maximize its performance. This technique allows an agent to take actions and interact with an environment, learn from that interaction and then maximize the total rewards. RL, is all about taking decisions sequentially. In simple words we can say that the output depends on the state of the present input and the next input depends on the output of the previous input.
Reinforcement learning (RL) differs from the other machine learning approaches in a way that in other techniques, the algorithm is not explicitly told how to perform a task, but works through the problem on its own. Reinforcement Learning and its Application has been applied successfully in a number of areas, which has produced some successful practical applications.
In Supervised machine learning technique, we have input and output in the form of labeled data that we used to train our system. Whereas in reinforcement learning, there is no training data that the reinforcement agent decides what to do to perform the given task. In the absence of training dataset, it is bound to learn from its experience.
Reinforcement Learning and its Application in Artificial Intelligence
Reinforcement learning in AI consists of a collection of computational approaches that are primarily motivated by their potential for resolving practical problems. RL is inspired by behaviorist psychology. It is similar to how a child learns to perform a new task. As an AI agent, which could be a self-driving car or a program playing chess, interacts with its environment, receives a reward state depending on how it performs, such as driving to destination safely or winning a game. Conversely, the agent receives a penalty for performing wrongly, such as going off the road or being checkmated.
Reinforcement Learning involves these simple steps:
- Observation of the given environment
- Deciding how to perform using some approach
- Act accordingly
- Receiving a penalty or reward
- Learning from the past experiences and refining our approach
- Iterate until an optimal approach is found

https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/
Advantages in AI
The AI agent over time makes decisions to maximize its reward and minimize its penalty using dynamic programming. The advantage of this approach to artificial intelligence is that it allows an AI program to learn without a programmer spelling out how an agent should perform the task.
Reinforcement Learning Algorithm
There are different algorithms that are used in reinforcement learning. We are discussing Q learning algorithm for understanding.
Q-Learning Algorithm
Q Learning is a simplest form of Reinforcement Learning which uses action values and Q-values to improve the performance of the learning agent. Q-learning lets the agent use the environment’s rewards to learn, over time, the best action to take in a given state. The main steps in Q learning Algorithm are:
Action values and Q-values
These values are estimation that, what action agent take to reach the goal state. Q (S, A) are defined as action and states.
Price and Episodes:
The agent takes a number of transitions from the current state to reach the goal state. These transitions based on number of actions which agent take while interacting with environment. The price and penalty added after each action. The agent reaches the goal state and complete his one episode successfully.
Temporal difference
This rule is used to estimate the Q value after each action of agent.

S: Current state of the agent
A: Actions which agent take
S‘ : Next state where agent have to move
A‘: By using current Q states pick up the next action.
R: Current Reward of agent
α: Length of steps taken to update the estimation function
ϒ: Discovering factor for future reward
Choosing the suitable action to take using ꓰ-Greedy approach
This approach chooses the action using Q-value estimation.
- Choose the actions which have highest Q-values
- Choose the actions at random
https://www.geeksforgeeks.org/q-learning-in-python/
Reinforcement Learning Example:
We have an agent robot and a goal state, with many obstacles in between. The agent is supposed to discover the best possible route to reach the goal.

The above figure shows robot agent, diamond and fire. The goal of the robot is to get the prize that is the diamond and avoid the obstacles that is fire. The robot learns by trying all the possible pathways and then picking the route which gives him the reward with the minimum hurdles. Each correct step will give the robot agent a reward and each incorrect step will deduct the reward of the robot agent. The total reward will be calculated when it reaches the final reward that is the diamond.
Main points in Reinforcement learning
Input:
The input is an initial state from which the agent will start.
Output:
There are many possible ways as there are multiple solution to a particular problem
Training:
The training of agent is based on the input; the model will return back a state and the user will decide to give reward or punish the agent totally based on its output. The agent keeps continues to learn.
Solution:
The best result is decided based on the maximum reward.
https://www.geeksforgeeks.org/what-is-reinforcement-learning/
Reinforcement Learning and its Application
Reinforcement Learning and its Application has been applied successfully in a number of areas, which has produced some successful practical applications.
- In industrial automation robotics control system are developed using reinforcement learning.
- RL use to solve combinatorial search problems such as computer game playing
- Machine learning and data processing
- RL, can be used to create training systems that provide custom instruction and materials according to the requirement of students
- In Energy consumption optimization
- Video games
Reinforcement learning python implementation using Q-learning
In this implementation, we try to teach a bot to reach its goal using the Q-Learning technique. The main steps in python that should be followed are given below.
Step 1: Importing the compulsory libraries in python
import numpy as np import pylab as pl import networkx as nx
Step 2: Initialize the Q-table
q_table = np.zeros([env.observation_space.n, env.action_space.n])
Step 3: Create the training algorithm that will update this Q-table
import random from IPython.display import clear_output alpha = 0.1 gamma = 0.6 epsilon = 0.1 all_epochs = [] all_penalties = [] for i in range(1, 100001): state = env.reset() epochs, penalties, reward, = 0, 0, 0 done = False while not done: if random.uniform(0, 1) < epsilon: action = env.action_space.sample() # Explore action space else: action = np.argmax(q_table[state]) # Exploit learned values next_state, reward, done, info = env.step(action) old_value = q_table[state, action] next_max = np.max(q_table[next_state]) q_table[state, action] = new_value if reward == -10: penalties += 1 state = next_state epochs += 1 if i % 100 == 0: clear_output(wait=True) print(f"Episode: {i}") print("Training finished.\n")
Output :
Episode: 100000 Training finished. Wall time: 30.6 s
Step 4: Q-table has been created over 100,000 episodes
q_table[328] array([ -2.30108105, -1.97092096, -2.30357004, -2.20591839, -10.3607344 , -8.5583017 ])
Step 5: Evaluating the agent
In the step we Evaluate the performance of our agent.
total_epochs, total_penalties = 0, 0 episodes = 100 for _ in range(episodes): state = env.reset() epochs, penalties, reward = 0, 0, 0 done = False while not done: action = np.argmax(q_table[state]) state, reward, done, info = env.step(action) if reward == -10: penalties += 1 epochs += 1 total_penalties += penalties total_epochs += epochs print(f"Results after {episodes} episodes:") print(f"Average timesteps per episode: {total_epochs / episodes}") print(f"Average penalties per episode: {total_penalties / episodes}")
Results:
Results after 100 episodes:
Average timesteps per episode: 12.3
Average penalties per episode: 0.0
By using this code we can apply the Reinforcement learning and measure the performance of our agent.
https://www.geeksforgeeks.org/what-is-reinforcement-learning/
https://www.techopedia.com/definition/32055/reinforcement-learning