Artificial Intelligence 11 min read

Reinforcement Learning with highway‑env and Gym: DQN for Autonomous Driving

This tutorial explains how to install the gym and highway‑env packages, configure a highway simulation environment, process observations and actions, build a DQN network in PyTorch, train the agent, and analyze training results for autonomous driving scenarios.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Reinforcement Learning with highway‑env and Gym: DQN for Autonomous Driving

1. Install the Environment

The gym library provides a unified interface for reinforcement‑learning environments; install it with pip install gym . Then install the highway‑env package (a collection of driving scenarios) via pip install --user git+https://github.com/eleurent/highway-env . The package includes six scenarios such as highway‑v0 , merge‑v0 , roundabout‑v0 , parking‑v0 , intersection‑v0 , and racetrack‑v0 . Documentation is available at https://highway-env.readthedocs.io/en/latest/ .

2. Configure the Environment

After installation, the environment can be created in Python. The example below shows how to load the highway‑v0 scenario and reset it:

<code>import gym
import highway_env
%matplotlib inline

env = gym.make('highway-v0')
env.reset()
for _ in range(3):
    action = env.action_type.actions_indexes["IDLE"]
    obs, reward, done, info = env.step(action)
    env.render()
</code>

The rendered window displays the ego vehicle (green) and surrounding traffic.

3. Data Processing

3.1 Observation (state)

highway‑env does not model sensors; all observations are generated directly from the simulator. Three observation types are supported:

Kinematics : a V × F matrix where V is the number of observed vehicles (including the ego vehicle) and F is the number of features (e.g., presence, x, y, vx, vy, cos_h, sin_h). Values are normalized to ranges such as [-100, 100] for positions and [-20, 20] for velocities.

Grayscale Image : a W × H grayscale image representing the scene.

Occupancy Grid : a W × H × F tensor describing the occupancy of each cell with multiple features.

Configuration for the Kinematics observation:

<code>config = {
    "observation": {
        "type": "Kinematics",
        "vehicles_count": 5,
        "features": ["presence", "x", "y", "vx", "vy", "cos_h", "sin_h"],
        "features_range": {
            "x": [-100, 100],
            "y": [-100, 100],
            "vx": [-20, 20],
            "vy": [-20, 20]
        },
        "absolute": False,
        "order": "sorted"
    },
    "simulation_frequency": 8,   # Hz
    "policy_frequency": 2          # Hz
}
</code>

3.2 Action Space

The environment provides both continuous and discrete actions. The discrete set contains five meta‑actions:

<code>ACTIONS_ALL = {
    0: 'LANE_LEFT',
    1: 'IDLE',
    2: 'LANE_RIGHT',
    3: 'FASTER',
    4: 'SLOWER'
}
</code>

3.3 Reward Function

All scenarios except parking share the same reward function (shown in the original article as an image). The function can only be modified in the source code; external code can only adjust weighting factors.

4. Build the Model

A DQN (Deep Q‑Network) is used. Because the Kinematics observation yields a small vector (5 vehicles × 7 features = 35 values), a simple fully‑connected network suffices:

<code>import torch
import torch.nn as nn

class DQNNet(nn.Module):
    def __init__(self):
        super(DQNNet, self).__init__()
        self.linear1 = nn.Linear(35, 35)
        self.linear2 = nn.Linear(35, 5)  # five discrete actions
    def forward(self, s):
        s = torch.FloatTensor(s)
        s = s.view(s.size(0), 1, 35)
        s = self.linear1(s)
        s = self.linear2(s)
        return s
</code>

The surrounding DQN class implements experience replay, epsilon‑greedy action selection, learning updates, and target‑network synchronization.

5. Training Loop

After configuring the environment with env.configure(config) , the training proceeds as follows:

<code>dqn = DQN()
count = 0
reward = []
avg_reward = 0
all_reward = []
time_ = []
all_time = []
collision_his = []
all_collision = []
while True:
    done = False
    start_time = time.time()
    s = env.reset()
    while not done:
        e = np.exp(-count/300)  # decreasing exploration rate
        a = dqn.choose_action(s, e)
        s_, r, done, info = env.step(a)
        env.render()
        dqn.push_memory(s, a, r, s_)
        if dqn.position != 0 and dqn.position % 99 == 0:
            loss_ = dqn.learn()
            count += 1
            print('trained times:', count)
            if count % 40 == 0:
                avg_reward = np.mean(reward)
                avg_time = np.mean(time_)
                collision_rate = np.mean(collision_his)
                all_reward.append(avg_reward)
                all_time.append(avg_time)
                all_collision.append(collision_rate)
                plt.plot(all_reward); plt.show()
                plt.plot(all_time); plt.show()
                plt.plot(all_collision); plt.show()
                reward = []
                time_ = []
                collision_his = []
        s = s_
        reward.append(r)
    end_time = time.time()
    episode_time = end_time - start_time
    time_.append(episode_time)
    is_collision = 1 if info['crashed'] else 0
    collision_his.append(is_collision)
</code>

During training, the script records average reward, episode duration, and collision rate every 40 updates and visualizes them with Matplotlib.

6. Results and Observations

Training curves show that the average collision rate steadily decreases as the agent learns, while episode length tends to increase (episodes end early when a crash occurs). Reward values improve correspondingly, indicating that the DQN learns to drive safely in the abstract highway environment.

7. Conclusion

Compared with high‑fidelity simulators like CARLA, highway‑env offers a lightweight, game‑style abstraction that is convenient for rapid reinforcement‑learning experiments. It removes the need for sensor modeling and data acquisition, making it suitable for end‑to‑end algorithm prototyping, though it provides limited control over low‑level vehicle dynamics.

Pythonreinforcement learningDQNautonomous drivinggymhighway-env
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.