Can a Drone Learn to Land Itself? A Deep Reinforcement Learning Walkthrough
This article walks through the fundamentals of reinforcement learning, builds a custom drone‑landing simulation, defines state and action spaces, designs reward functions, implements a neural‑network policy with Bernoulli sampling, and trains it using REINFORCE with baseline techniques, while exposing common pitfalls such as reward‑cheating.
Reinforcement learning (RL) differs from supervised learning by letting an agent discover good behavior through trial‑and‑error feedback instead of being shown correct examples. The core components are the agent (actor), environment (world), policy (action‑selection rule), state (observable snapshot), action, and reward (feedback).
1. Delivery‑Drone Game
The author created a simple 2‑D game where a virtual delivery drone must land on a platform. Successful landing requires (1) horizontal alignment within ±0.0625 units, (2) speed below 0.3, (3) tilt angle under 20°, and (4) correct altitude. The full source code and a runnable GitHub repository are provided.
2. State Representation
The environment exposes a 15‑dimensional normalized state vector:
x, y – drone position
ux, uy – horizontal and vertical velocity
θ – tilt angle (0 = upright)
ω – angular velocity
f – fuel level (0‑1)
px, py – platform position
d – Euclidean distance to platform center
dx, dy – relative offset to platform
u – speed magnitude
landed, crashed – termination flags
All values are scaled to the range [‑1, 1] or [0, 1] to aid stable neural‑network training.
3. Action Space
Instead of enumerating 2³=8 discrete actions, the author treats each of the three thrusters (main, left, right) as an independent binary decision sampled from a Bernoulli distribution. This reduces the action space to three independent bits.
4. Reward Design
The reward function combines several terms:
Positive reward for being close to the platform and moving slowly.
Penalty for time elapsed and for being below the platform.
Gaussian‑scaled bonuses that increase as distance → 0.
To reduce variance, the author uses a baseline equal to the average episode return and computes the advantage Aₜ = Gₜ – b, normalising it to mean 0 and std 1 before applying the policy‑gradient update.
5. Reward‑Cheating Pitfall
During early experiments the drone learned to hover just below the platform, repeatedly collecting distance‑based rewards while avoiding the large crash penalty. This “reward‑cheating” occurs because the reward depends only on the current state, not on the trajectory. The author demonstrates this behavior with reward‑trajectory plots.
6. Policy Network
A simple feed‑forward neural network receives the 15‑dimensional state and outputs three logits, each passed through a sigmoid to obtain Bernoulli probabilities for the thrusters. The network is trained with the REINFORCE gradient: loss = -logπ(a|s) * advantage Because RL frameworks minimise loss, the negative of the expected return is used.
7. Training Strategies
The author compares three update schemes:
Per‑step update – too noisy.
Per‑episode update – better but still high variance.
Multi‑episode batch update – runs several parallel episodes (e.g., six) and updates after averaging returns; this yields the most stable learning.
Sample code for launching parallel episodes is shown in the repository.
8. REINFORCE with Baseline
The classic REINFORCE theorem (Williams, 1992) is presented, followed by the baseline‑augmented version that reduces gradient variance. The author includes the full mathematical expressions and a diagram of the algorithm.
9. Conclusions & Future Work
After extensive tuning, the drone finally learns to land reliably, though occasional hovering issues remain. Future directions include Actor‑Critic methods, Deep Q‑Learning, PPO/GRPO, and applying the approach to real‑world systems.
10. References
Williams, R.J. (1992). Simple Statistical Gradient‑Following Algorithms for Connectionist Reinforcement Learning.
Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning.
OpenAI Spinning‑Up: https://spinningup.openai.com/
GitHub repository: https://github.com/vedant-jumle/reinforcement-learning-101
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
