State-action-reward-state-action (SARSA) is an on-policy algorithm designed to teach a machine learning model a new Markov decision process policy in order to solve reinforcement learning challenges. It’s an algorithm where, in the current state, S, an action, A, is taken and the agent gets a reward, R, and ends up in the next state, S1, and takes action, A1, in S1. Therefore, the tuple (S, A, R, S1, A1) stands for the acronym SARSA. 

What Is SARSA?

SARSA is an on-policy algorithm used in reinforcement learning to train a Markov decision process model on a new policy. It’s an algorithm where, in the current state, S, an action, A, is taken and the agent gets a reward, R, and ends up in the next state, S1, and takes action, A1, in S1, or in other words, the tuple S, A, R, S1, A1.

It’s called an on-policy algorithm because it updates the policy based on actions taken.

 

SARSA vs Q-learning

The difference between these two algorithms is that SARSA chooses an action following the current policy and updates its Q-values, whereas Q-learning chooses the greedy action. A greedy action is one that gives the maximum Q-value for the state, that is, it follows an optimal policy.

More on Machine Learning: Markov Chain Explained

 

SARSA Algorithm 

The algorithm for SARSA is a little bit different from Q-learning.

In SARSA, the Q-value is updated taking into account the action, A1, performed in the state, S1. In Q-learning, the action with the highest Q-value in the next state, S1, is used to update the Q-table.

A video tutorial on how SARSA works in machine learning. | Video: Pankaj Porwal.

More on Machine Learning:  How Does Backpropagation in a Neural Network Work?

 

How to Use SARSA 

Now, let’s look at the code for SARSA to solve the FrozenLake environment:

import gym
import numpy as np
import time, pickle, os

env = gym.make('FrozenLake-v0')

epsilon = 0.9
# min_epsilon = 0.1
# max_epsilon = 1.0
# decay_rate = 0.01

total_episodes = 10000
max_steps = 100

lr_rate = 0.81
gamma = 0.96

Q = np.zeros((env.observation_space.n, env.action_space.n))
    
def choose_action(state):
	action=0
	if np.random.uniform(0, 1) < epsilon:
		action = env.action_space.sample()
	else:
		action = np.argmax(Q[state, :])
	return action

def learn(state, state2, reward, action, action2):
	predict = Q[state, action]
	target = reward + gamma * Q[state2, action2]
	Q[state, action] = Q[state, action] + lr_rate * (target - predict)

# Start
rewards=0

for episode in range(total_episodes):
	t = 0
	state = env.reset()
	action = choose_action(state)
    
	while t < max_steps:
		env.render()

		state2, reward, done, info = env.step(action)

		action2 = choose_action(state2)

		learn(state, state2, reward, action, action2)

		state = state2
		action = action2

		t += 1
		rewards+=1

		if done:
			break
  # epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode) 
  # os.system('clear')
		time.sleep(0.1)

    
print ("Score over time: ", rewards/total_episodes)
print(Q)

with open("frozenLake_qTable_sarsa.pkl", 'wb') as f:
	pickle.dump(Q, f)

You’ll see that the code is similar to what’s used in Q-learning to solve the FrozenLake environment. 

Now, let’s dissect it.

On lines 38 and 39, an action is chosen for the initial state.

In the SARSA tuple, we now have:

                        (State, Action)

Then, this action is taken in the environment, and the reward and next state are observed on line 44.

Now, the tuple has:

               (State, Action, Reward, State1)

On line 46, an action is chosen for the next state using the choose_state(…) function.

The action chosen by the choose_action(…) function is done using the epsilon-greedy approach.

Now, the tuple is complete as:

           (State, Action, Reward, State1, Action1)

On line 48, the learn(…) function updates the Q-table using the following equation:

Update equation for the Q-value in SARSA.
Update equation for the Q-value in SARSA. | Image: Adesh Gautam

In the SARSA update equation, the Q-value is chosen using S’ and A’, the next state and the action chosen in the next state, respectively. This stands in contrast to Q-learning, which updates the equation where the max of Q(S’, a) is taken.

The rest of the code is similar to the Q-learning code.

Agent in action.
Agent in action. | Gif: Adesh Gautam

Try to tweak the different parameters to get better results. 

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us