Artificial Intelligence 20 min read

Adapting Soft Actor‑Critic for Discrete Action Spaces in Deep Reinforcement Learning

This article explains how to modify the Soft Actor‑Critic (SAC) algorithm—originally designed for continuous actions—to work with discrete action environments, presents the required changes to the actor and critic loss functions, provides a full PyTorch implementation, and evaluates the method on the CartPole‑v1 benchmark.

Code DAO

Nov 28, 2021

Adapting Soft Actor‑Critic for Discrete Action Spaces in Deep Reinforcement Learning

Introduction

Since its release in 2018, Soft Actor‑Critic (SAC) has become one of the most popular deep reinforcement learning (DRL) algorithms, but most publications assume a continuous action space. This article describes the adjustments needed to apply SAC to environments with discrete actions, based on the 2019 paper "Soft Actor‑Critic for Discrete Action Settings".

SAC Overview

SAC is an actor‑critic method that combines policy optimization with Q‑learning. The critic is trained with a Bellman‑based cost function, while the actor minimizes a cost that maximizes expected return. The key innovation of SAC is entropy regularization, which encourages stochastic policies and helps avoid local optima.

Entropy regularization measures the randomness of a policy distribution and is added to the reward to balance exploitation and exploration. In SAC the policy is already stochastic, so entropy can be computed directly.

Key Equations with Entropy Regularization

The actor loss for a continuous policy is

The critic loss uses a Bellman target that includes the entropy term:

Adapting SAC to Discrete Actions

Two main changes are required:

The Q‑function must output a vector of Q‑values, one per discrete action, instead of a scalar.

The policy now outputs a probability vector (a categorical distribution) rather than mean and variance parameters.

Consequently the actor loss becomes

and the critic loss uses the discrete soft‑state value:

The temperature (α) loss is also adjusted to the discrete case.

Python Implementation

The implementation uses PyTorch and OpenAI Gym. The main class Discrete_SAC_Agent creates two critic networks, two target networks, an actor network with a softmax output, and a learnable temperature parameter.

# Hyperparameters
ALPHA_INITIAL = 1.0
REPLAY_BUFFER_BATCH_SIZE = 100
DISCOUNT_RATE = 0.99
LEARNING_RATE = 1e-4
SOFT_UPDATE_INTERPOLATION_FACTOR = 0.01

class Discrete_SAC_Agent:
    def __init__(self, environment):
        self.environment = environment
        self.state_dim = environment.observation_space.shape[0]
        self.action_dim = environment.action_space.n
        # Initialise networks
        self.critic_local = Network(input_dimension=self.state_dim, output_dimension=self.action_dim)
        self.critic_local2 = Network(input_dimension=self.state_dim, output_dimension=self.action_dim)
        self.critic_target = Network(input_dimension=self.state_dim, output_dimension=self.action_dim)
        self.critic_target2 = Network(input_dimension=self.state_dim, output_dimension=self.action_dim)
        self.actor_local = Network(input_dimension=self.state_dim, output_dimension=self.action_dim, output_activation=torch.nn.Softmax(dim=1))
        # Optimisers
        self.critic_optimiser = torch.optim.Adam(self.critic_local.parameters(), lr=LEARNING_RATE)
        self.critic_optimiser2 = torch.optim.Adam(self.critic_local2.parameters(), lr=LEARNING_RATE)
        self.actor_optimiser = torch.optim.Adam(self.actor_local.parameters(), lr=LEARNING_RATE)
        # Replay buffer
        self.replay_buffer = ReplayBuffer(self.environment)
        # Temperature
        self.target_entropy = 0.98 * -np.log(1.0 / self.environment.action_space.n)
        self.log_alpha = torch.tensor(np.log(ALPHA_INITIAL), requires_grad=True)
        self.alpha = self.log_alpha.exp()
        self.alpha_optimiser = torch.optim.Adam([self.log_alpha], lr=LEARNING_RATE)

Key helper methods include soft_update_target_networks (Polyak averaging), train_on_transition (stores a transition and triggers a training step when enough samples are available), critic_loss, actor_loss, and temperature_loss. The agent selects actions either deterministically (argmax) during evaluation or stochastically (sampling from the softmax distribution) during training.

Training and Evaluation on CartPole‑v1

The algorithm is tested on the OpenAI Gym CartPole‑v1 environment, which has two discrete actions (move left or right). The training script runs five independent runs, each with 400 episodes (maximum 200 steps per episode). Every fourth episode is an evaluation episode where the agent acts deterministically and the episode reward is recorded.

# Training loop (simplified)
for run in range(RUNS):
    agent = SACAgent(env)
    for episode_number in range(EPISODES_PER_RUN):
        evaluation_episode = (episode_number % TRAINING_EVALUATION_RATIO == 0)
        state = env.reset()
        done = False
        episode_reward = 0
        while not done and i < STEPS_PER_EPISODE:
            action = agent.get_next_action(state, evaluation_episode)
            next_state, reward, done, _ = env.step(action)
            if not evaluation_episode:
                agent.train_on_transition(state, action, next_state, reward, done)
            else:
                episode_reward += reward
            state = next_state
        if evaluation_episode:
            run_results.append(episode_reward)
    agent_results.append(run_results)

Results show that after roughly 50 episodes the agent begins to achieve the maximum score of 200, but training remains unstable with occasional score drops until the end of training.

References

[1] Petros Christodoulou, Soft Actor‑Critic for Discrete Action Settings , arXiv, 16 Oct 2019, https://arxiv.org/abs/1910.07207

[2] Haarnoja et al., Soft Actor‑Critic: Off‑Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , arXiv, 4 Jan 2018, https://arxiv.org/abs/1801.01290

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PyTorch reinforcement learning OpenAI Gym CartPole Discrete Actions Entropy Regularization Soft Actor-Critic

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.