State-action-reward-state-action (SARSA) is an on-policy algorithm designed to teach a machine learning model a new Markov decision process policy in order to solve reinforcement learning challenges. It’s an algorithm where, in the current state, S, an action, A, is taken and the agent gets a reward, R, and ends up in the next state, S1, and takes action, A1, in S1. Therefore, the tuple (S, A, R, S1, A1) stands for the acronym SARSA.
What Is SARSA?
SARSA is an on-policy algorithm used in reinforcement learning to train a Markov decision process model on a new policy. It’s an algorithm where, in the current state, S, an action, A, is taken and the agent gets a reward, R, and ends up in the next state, S1, and takes action, A1, in S1, or in other words, the tuple S, A, R, S1, A1.
It’s called an on-policy algorithm because it updates the policy based on actions taken.
SARSA vs Q-learning
The difference between these two algorithms is that SARSA chooses an action following the current policy and updates its Q-values, whereas Q-learning chooses the greedy action. A greedy action is one that gives the maximum Q-value for the state, that is, it follows an optimal policy.
SARSA Algorithm
The algorithm for SARSA is a little bit different from Q-learning.
In SARSA, the Q-value is updated taking into account the action, A1, performed in the state, S1. In Q-learning, the action with the highest Q-value in the next state, S1, is used to update the Q-table.
How to Use SARSA
Now, let’s look at the code for SARSA to solve the FrozenLake environment:
import gym
import numpy as np
import time, pickle, os
env = gym.make('FrozenLake-v0')
epsilon = 0.9
# min_epsilon = 0.1
# max_epsilon = 1.0
# decay_rate = 0.01
total_episodes = 10000
max_steps = 100
lr_rate = 0.81
gamma = 0.96
Q = np.zeros((env.observation_space.n, env.action_space.n))
def choose_action(state):
action=0
if np.random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state, :])
return action
def learn(state, state2, reward, action, action2):
predict = Q[state, action]
target = reward + gamma * Q[state2, action2]
Q[state, action] = Q[state, action] + lr_rate * (target - predict)
# Start
rewards=0
for episode in range(total_episodes):
t = 0
state = env.reset()
action = choose_action(state)
while t < max_steps:
env.render()
state2, reward, done, info = env.step(action)
action2 = choose_action(state2)
learn(state, state2, reward, action, action2)
state = state2
action = action2
t += 1
rewards+=1
if done:
break
# epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
# os.system('clear')
time.sleep(0.1)
print ("Score over time: ", rewards/total_episodes)
print(Q)
with open("frozenLake_qTable_sarsa.pkl", 'wb') as f:
pickle.dump(Q, f)
You’ll see that the code is similar to what’s used in Q-learning to solve the FrozenLake environment.
Now, let’s dissect it.
On lines 38 and 39, an action is chosen for the initial state.
In the SARSA tuple, we now have:
(State, Action)
Then, this action is taken in the environment, and the reward and next state are observed on line 44.
Now, the tuple has:
(State, Action, Reward, State1)
On line 46, an action is chosen for the next state using the choose_state(…)
function.
The action chosen by the choose_action(…)
function is done using the epsilon-greedy approach.
Now, the tuple is complete as:
(State, Action, Reward, State1, Action1)
On line 48, the learn(…)
function updates the Q-table using the following equation:

In the SARSA update equation, the Q-value is chosen using S’ and A’, the next state and the action chosen in the next state, respectively. This stands in contrast to Q-learning, which updates the equation where the max of Q(S’, a) is taken.
The rest of the code is similar to the Q-learning code.

Try to tweak the different parameters to get better results.