Adapting Soft Actor‑Critic for Discrete Action Spaces in Deep Reinforcement Learning
This article explains how to modify the Soft Actor‑Critic (SAC) algorithm—originally designed for continuous actions—to work with discrete action environments, presents the required changes to the actor and critic loss functions, provides a full PyTorch implementation, and evaluates the method on the CartPole‑v1 benchmark.
Introduction
Since its release in 2018, Soft Actor‑Critic (SAC) has become one of the most popular deep reinforcement learning (DRL) algorithms, but most publications assume a continuous action space. This article describes the adjustments needed to apply SAC to environments with discrete actions, based on the 2019 paper "Soft Actor‑Critic for Discrete Action Settings".
SAC Overview
SAC is an actor‑critic method that combines policy optimization with Q‑learning. The critic is trained with a Bellman‑based cost function, while the actor minimizes a cost that maximizes expected return. The key innovation of SAC is entropy regularization, which encourages stochastic policies and helps avoid local optima.
Entropy regularization measures the randomness of a policy distribution and is added to the reward to balance exploitation and exploration. In SAC the policy is already stochastic, so entropy can be computed directly.
Key Equations with Entropy Regularization
The actor loss for a continuous policy is
The critic loss uses a Bellman target that includes the entropy term:
Adapting SAC to Discrete Actions
Two main changes are required:
The Q‑function must output a vector of Q‑values, one per discrete action, instead of a scalar.
The policy now outputs a probability vector (a categorical distribution) rather than mean and variance parameters.
Consequently the actor loss becomes
and the critic loss uses the discrete soft‑state value:
The temperature (α) loss is also adjusted to the discrete case.
Python Implementation
The implementation uses PyTorch and OpenAI Gym. The main class Discrete_SAC_Agent creates two critic networks, two target networks, an actor network with a softmax output, and a learnable temperature parameter.
# Hyperparameters
ALPHA_INITIAL = 1.0
REPLAY_BUFFER_BATCH_SIZE = 100
DISCOUNT_RATE = 0.99
LEARNING_RATE = 1e-4
SOFT_UPDATE_INTERPOLATION_FACTOR = 0.01
class Discrete_SAC_Agent:
def __init__(self, environment):
self.environment = environment
self.state_dim = environment.observation_space.shape[0]
self.action_dim = environment.action_space.n
# Initialise networks
self.critic_local = Network(input_dimension=self.state_dim, output_dimension=self.action_dim)
self.critic_local2 = Network(input_dimension=self.state_dim, output_dimension=self.action_dim)
self.critic_target = Network(input_dimension=self.state_dim, output_dimension=self.action_dim)
self.critic_target2 = Network(input_dimension=self.state_dim, output_dimension=self.action_dim)
self.actor_local = Network(input_dimension=self.state_dim, output_dimension=self.action_dim, output_activation=torch.nn.Softmax(dim=1))
# Optimisers
self.critic_optimiser = torch.optim.Adam(self.critic_local.parameters(), lr=LEARNING_RATE)
self.critic_optimiser2 = torch.optim.Adam(self.critic_local2.parameters(), lr=LEARNING_RATE)
self.actor_optimiser = torch.optim.Adam(self.actor_local.parameters(), lr=LEARNING_RATE)
# Replay buffer
self.replay_buffer = ReplayBuffer(self.environment)
# Temperature
self.target_entropy = 0.98 * -np.log(1.0 / self.environment.action_space.n)
self.log_alpha = torch.tensor(np.log(ALPHA_INITIAL), requires_grad=True)
self.alpha = self.log_alpha.exp()
self.alpha_optimiser = torch.optim.Adam([self.log_alpha], lr=LEARNING_RATE)Key helper methods include soft_update_target_networks (Polyak averaging), train_on_transition (stores a transition and triggers a training step when enough samples are available), critic_loss, actor_loss, and temperature_loss. The agent selects actions either deterministically (argmax) during evaluation or stochastically (sampling from the softmax distribution) during training.
Training and Evaluation on CartPole‑v1
The algorithm is tested on the OpenAI Gym CartPole‑v1 environment, which has two discrete actions (move left or right). The training script runs five independent runs, each with 400 episodes (maximum 200 steps per episode). Every fourth episode is an evaluation episode where the agent acts deterministically and the episode reward is recorded.
# Training loop (simplified)
for run in range(RUNS):
agent = SACAgent(env)
for episode_number in range(EPISODES_PER_RUN):
evaluation_episode = (episode_number % TRAINING_EVALUATION_RATIO == 0)
state = env.reset()
done = False
episode_reward = 0
while not done and i < STEPS_PER_EPISODE:
action = agent.get_next_action(state, evaluation_episode)
next_state, reward, done, _ = env.step(action)
if not evaluation_episode:
agent.train_on_transition(state, action, next_state, reward, done)
else:
episode_reward += reward
state = next_state
if evaluation_episode:
run_results.append(episode_reward)
agent_results.append(run_results)Results show that after roughly 50 episodes the agent begins to achieve the maximum score of 200, but training remains unstable with occasional score drops until the end of training.
References
[1] Petros Christodoulou, Soft Actor‑Critic for Discrete Action Settings , arXiv, 16 Oct 2019, https://arxiv.org/abs/1910.07207
[2] Haarnoja et al., Soft Actor‑Critic: Off‑Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , arXiv, 4 Jan 2018, https://arxiv.org/abs/1801.01290
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
