State-action-reward-state-action (SARSA) is an on-policy algorithm designed to teach a machine learning model a new Markov decision process policy in order to solve reinforcement learning challenges. It’s an algorithm where, in the current state, S, an action, A, is taken and the agent gets a reward, R, and ends up in the next state, S1, and takes action, A1, in S1. Therefore, the tuple (S, A, R, S1, A1) stands for the acronym SARSA.

## What Is SARSA?

SARSA is an on-policy algorithm used in reinforcement learning to train a Markov decision process model on a new policy. It’s an algorithm where, in the current state, S, an action, A, is taken and the agent gets a reward, R, and ends up in the next state, S1, and takes action, A1, in S1, or in other words, the tuple S, A, R, S1, A1.

It’s called an on-policy algorithm because it updates the policy based on actions taken.

## SARSA vs Q-learning

The difference between these two algorithms is that SARSA chooses an action following the current policy and updates its Q-values, whereas Q-learning chooses the greedy action. A greedy action is one that gives the maximum Q-value for the state, that is, it follows an optimal policy.

More on Machine Learning: Markov Chain Explained

## SARSA Algorithm

The algorithm for SARSA is a little bit different from Q-learning.

In SARSA, the Q-value is updated taking into account the action, A1, performed in the state, S1. In Q-learning, the action with the highest Q-value in the next state, S1, is used to update the Q-table.

More on Machine Learning:  How Does Backpropagation in a Neural Network Work?

## How to Use SARSA

Now, let’s look at the code for SARSA to solve the FrozenLake environment:

``````import gym
import numpy as np
import time, pickle, os

env = gym.make('FrozenLake-v0')

epsilon = 0.9
# min_epsilon = 0.1
# max_epsilon = 1.0
# decay_rate = 0.01

total_episodes = 10000
max_steps = 100

lr_rate = 0.81
gamma = 0.96

Q = np.zeros((env.observation_space.n, env.action_space.n))

def choose_action(state):
action=0
if np.random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state, :])
return action

def learn(state, state2, reward, action, action2):
predict = Q[state, action]
target = reward + gamma * Q[state2, action2]
Q[state, action] = Q[state, action] + lr_rate * (target - predict)

# Start
rewards=0

for episode in range(total_episodes):
t = 0
state = env.reset()
action = choose_action(state)

while t < max_steps:
env.render()

state2, reward, done, info = env.step(action)

action2 = choose_action(state2)

learn(state, state2, reward, action, action2)

state = state2
action = action2

t += 1
rewards+=1

if done:
break
# epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
# os.system('clear')
time.sleep(0.1)

print ("Score over time: ", rewards/total_episodes)
print(Q)

with open("frozenLake_qTable_sarsa.pkl", 'wb') as f:
pickle.dump(Q, f)``````

You’ll see that the code is similar to what’s used in Q-learning to solve the FrozenLake environment.

Now, let’s dissect it.

On lines 38 and 39, an action is chosen for the initial state.

In the SARSA tuple, we now have:

``                        (State, Action)``

Then, this action is taken in the environment, and the reward and next state are observed on line 44.

Now, the tuple has:

``               (State, Action, Reward, State1)``

On line 46, an action is chosen for the next state using the `choose_state(…)` function.

The action chosen by the `choose_action(…)` function is done using the epsilon-greedy approach.

Now, the tuple is complete as:

``           (State, Action, Reward, State1, Action1)``

On line 48, the `learn(…)` function updates the Q-table using the following equation:

In the SARSA update equation, the Q-value is chosen using S’ and A’, the next state and the action chosen in the next state, respectively. This stands in contrast to Q-learning, which updates the equation where the max of Q(S’, a) is taken.

The rest of the code is similar to the Q-learning code.