Reinforcement Learning with highway‑env and Gym: DQN for Autonomous Driving
This tutorial explains how to install the gym and highway‑env packages, configure a highway simulation environment, process observations and actions, build a DQN network in PyTorch, train the agent, and analyze training results for autonomous driving scenarios.
1. Install the Environment
The gym library provides a unified interface for reinforcement‑learning environments; install it with pip install gym . Then install the highway‑env package (a collection of driving scenarios) via pip install --user git+https://github.com/eleurent/highway-env . The package includes six scenarios such as highway‑v0 , merge‑v0 , roundabout‑v0 , parking‑v0 , intersection‑v0 , and racetrack‑v0 . Documentation is available at https://highway-env.readthedocs.io/en/latest/ .
2. Configure the Environment
After installation, the environment can be created in Python. The example below shows how to load the highway‑v0 scenario and reset it:
<code>import gym
import highway_env
%matplotlib inline
env = gym.make('highway-v0')
env.reset()
for _ in range(3):
action = env.action_type.actions_indexes["IDLE"]
obs, reward, done, info = env.step(action)
env.render()
</code>The rendered window displays the ego vehicle (green) and surrounding traffic.
3. Data Processing
3.1 Observation (state)
highway‑env does not model sensors; all observations are generated directly from the simulator. Three observation types are supported:
Kinematics : a V × F matrix where V is the number of observed vehicles (including the ego vehicle) and F is the number of features (e.g., presence, x, y, vx, vy, cos_h, sin_h). Values are normalized to ranges such as [-100, 100] for positions and [-20, 20] for velocities.
Grayscale Image : a W × H grayscale image representing the scene.
Occupancy Grid : a W × H × F tensor describing the occupancy of each cell with multiple features.
Configuration for the Kinematics observation:
<code>config = {
"observation": {
"type": "Kinematics",
"vehicles_count": 5,
"features": ["presence", "x", "y", "vx", "vy", "cos_h", "sin_h"],
"features_range": {
"x": [-100, 100],
"y": [-100, 100],
"vx": [-20, 20],
"vy": [-20, 20]
},
"absolute": False,
"order": "sorted"
},
"simulation_frequency": 8, # Hz
"policy_frequency": 2 # Hz
}
</code>3.2 Action Space
The environment provides both continuous and discrete actions. The discrete set contains five meta‑actions:
<code>ACTIONS_ALL = {
0: 'LANE_LEFT',
1: 'IDLE',
2: 'LANE_RIGHT',
3: 'FASTER',
4: 'SLOWER'
}
</code>3.3 Reward Function
All scenarios except parking share the same reward function (shown in the original article as an image). The function can only be modified in the source code; external code can only adjust weighting factors.
4. Build the Model
A DQN (Deep Q‑Network) is used. Because the Kinematics observation yields a small vector (5 vehicles × 7 features = 35 values), a simple fully‑connected network suffices:
<code>import torch
import torch.nn as nn
class DQNNet(nn.Module):
def __init__(self):
super(DQNNet, self).__init__()
self.linear1 = nn.Linear(35, 35)
self.linear2 = nn.Linear(35, 5) # five discrete actions
def forward(self, s):
s = torch.FloatTensor(s)
s = s.view(s.size(0), 1, 35)
s = self.linear1(s)
s = self.linear2(s)
return s
</code>The surrounding DQN class implements experience replay, epsilon‑greedy action selection, learning updates, and target‑network synchronization.
5. Training Loop
After configuring the environment with env.configure(config) , the training proceeds as follows:
<code>dqn = DQN()
count = 0
reward = []
avg_reward = 0
all_reward = []
time_ = []
all_time = []
collision_his = []
all_collision = []
while True:
done = False
start_time = time.time()
s = env.reset()
while not done:
e = np.exp(-count/300) # decreasing exploration rate
a = dqn.choose_action(s, e)
s_, r, done, info = env.step(a)
env.render()
dqn.push_memory(s, a, r, s_)
if dqn.position != 0 and dqn.position % 99 == 0:
loss_ = dqn.learn()
count += 1
print('trained times:', count)
if count % 40 == 0:
avg_reward = np.mean(reward)
avg_time = np.mean(time_)
collision_rate = np.mean(collision_his)
all_reward.append(avg_reward)
all_time.append(avg_time)
all_collision.append(collision_rate)
plt.plot(all_reward); plt.show()
plt.plot(all_time); plt.show()
plt.plot(all_collision); plt.show()
reward = []
time_ = []
collision_his = []
s = s_
reward.append(r)
end_time = time.time()
episode_time = end_time - start_time
time_.append(episode_time)
is_collision = 1 if info['crashed'] else 0
collision_his.append(is_collision)
</code>During training, the script records average reward, episode duration, and collision rate every 40 updates and visualizes them with Matplotlib.
6. Results and Observations
Training curves show that the average collision rate steadily decreases as the agent learns, while episode length tends to increase (episodes end early when a crash occurs). Reward values improve correspondingly, indicating that the DQN learns to drive safely in the abstract highway environment.
7. Conclusion
Compared with high‑fidelity simulators like CARLA, highway‑env offers a lightweight, game‑style abstraction that is convenient for rapid reinforcement‑learning experiments. It removes the need for sensor modeling and data acquisition, making it suitable for end‑to‑end algorithm prototyping, though it provides limited control over low‑level vehicle dynamics.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.