Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

This article walks through the fundamentals of reinforcement learning, builds a custom drone‑landing simulation, defines state and action spaces, designs reward functions, implements a neural‑network policy with Bernoulli sampling, and trains it using REINFORCE with baseline techniques, while exposing common pitfalls such as reward‑cheating.

Data Party THU
Data Party THU
Data Party THU
Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough

Reinforcement learning (RL) differs from supervised learning by letting an agent discover good behavior through trial‑and‑error feedback instead of being shown correct examples. The core components are the agent (actor), environment (world), policy (action‑selection rule), state (observable snapshot), action, and reward (feedback).

1. Delivery‑Drone Game

The author created a simple 2‑D game where a virtual delivery drone must land on a platform. Successful landing requires (1) horizontal alignment within ±0.0625 units, (2) speed below 0.3, (3) tilt angle under 20°, and (4) correct altitude. The full source code and a runnable GitHub repository are provided.

Game screenshot showing drone and landing platform
Game screenshot showing drone and landing platform

2. State Representation

The environment exposes a 15‑dimensional normalized state vector:

x, y – drone position

ux, uy – horizontal and vertical velocity

θ – tilt angle (0 = upright)

ω – angular velocity

f – fuel level (0‑1)

px, py – platform position

d – Euclidean distance to platform center

dx, dy – relative offset to platform

u – speed magnitude

landed, crashed – termination flags

All values are scaled to the range [‑1, 1] or [0, 1] to aid stable neural‑network training.

3. Action Space

Instead of enumerating 2³=8 discrete actions, the author treats each of the three thrusters (main, left, right) as an independent binary decision sampled from a Bernoulli distribution. This reduces the action space to three independent bits.

Diagram of three independent thruster actions
Diagram of three independent thruster actions

4. Reward Design

The reward function combines several terms:

Positive reward for being close to the platform and moving slowly.

Penalty for time elapsed and for being below the platform.

Gaussian‑scaled bonuses that increase as distance → 0.

To reduce variance, the author uses a baseline equal to the average episode return and computes the advantage Aₜ = Gₜ – b, normalising it to mean 0 and std 1 before applying the policy‑gradient update.

5. Reward‑Cheating Pitfall

During early experiments the drone learned to hover just below the platform, repeatedly collecting distance‑based rewards while avoiding the large crash penalty. This “reward‑cheating” occurs because the reward depends only on the current state, not on the trajectory. The author demonstrates this behavior with reward‑trajectory plots.

Reward plot showing hovering cheat
Reward plot showing hovering cheat

6. Policy Network

A simple feed‑forward neural network receives the 15‑dimensional state and outputs three logits, each passed through a sigmoid to obtain Bernoulli probabilities for the thrusters. The network is trained with the REINFORCE gradient: loss = -logπ(a|s) * advantage Because RL frameworks minimise loss, the negative of the expected return is used.

7. Training Strategies

The author compares three update schemes:

Per‑step update – too noisy.

Per‑episode update – better but still high variance.

Multi‑episode batch update – runs several parallel episodes (e.g., six) and updates after averaging returns; this yields the most stable learning.

Sample code for launching parallel episodes is shown in the repository.

8. REINFORCE with Baseline

The classic REINFORCE theorem (Williams, 1992) is presented, followed by the baseline‑augmented version that reduces gradient variance. The author includes the full mathematical expressions and a diagram of the algorithm.

9. Conclusions & Future Work

After extensive tuning, the drone finally learns to land reliably, though occasional hovering issues remain. Future directions include Actor‑Critic methods, Deep Q‑Learning, PPO/GRPO, and applying the approach to real‑world systems.

10. References

Williams, R.J. (1992). Simple Statistical Gradient‑Following Algorithms for Connectionist Reinforcement Learning.

Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning.

OpenAI Spinning‑Up: https://spinningup.openai.com/

GitHub repository: https://github.com/vedant-jumle/reinforcement-learning-101

Pythonreinforcement learningPolicy GradientOpenAI GymReward Shapingdrone landing
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.