Artificial Intelligence 7 min read

How High‑Low UAV Collaboration Beats Solo Drones in Complex Navigation

Researchers from Beihang University present a high‑low UAV collaborative navigation paradigm, introducing the HaL‑13k dataset and AeroDuo framework, detailing high‑altitude planning with Pilot‑LLM, low‑altitude three‑stage search, and demonstrating superior target finding in complex environments.

Data Party THU

Jul 29, 2025

How High‑Low UAV Collaboration Beats Solo Drones in Complex Navigation

High‑Low UAV Collaborative Navigation Paradigm

The system consists of a high‑altitude UAV (the “panoramic commander”) and a low‑altitude UAV (the “ground scout”). The high‑altitude platform captures a wide‑area orthographic view, builds a global map, and predicts a probability distribution of target locations. The low‑altitude platform receives this distribution, plans detailed waypoints, and performs fine‑grained visual search. This division of labor enables rapid target acquisition in cluttered environments where a single UAV would either miss small objects (if flying high) or lose global context (if flying low).

HaL‑13k Dataset

To support high‑low collaborative navigation, the authors extended the UAV‑Need‑Help benchmark, which originally contained only single‑UAV trajectories, by adding synchronized high‑altitude UAV flight paths, perception data, and refined low‑altitude trajectories. The resulting dataset, named HaL‑13k, contains 13,000 paired high‑/low‑altitude sequences with annotated target locations, environmental landmarks, and multi‑modal sensor streams (RGB, depth, and GPS).

Pilot‑LLM Multimodal Framework

Pilot‑LLM integrates a large language model (LLM) with visual encoders to perform multimodal reasoning for UAV navigation. The framework consists of three core modules:

Global Map Construction

Historical aerial images captured by the high‑altitude UAV are rectified using orthographic projection to remove perspective distortion. The rectified tiles are then stitched into a unified coordinate system, producing a metric‑accurate global map that serves as the spatial context for downstream reasoning.

Lightweight Decoder for Target Probability Map

A compact convolutional decoder receives the global map and the LLM‑generated semantic query (e.g., “find a parked car under a tree”). It outputs a dense probability map P(x, y) indicating the likelihood of the target at each map cell. The decoder is trained with a cross‑entropy loss on the annotated target locations and includes a Gaussian smoothing layer to balance exploration and spatial precision.

Low‑Altitude UAV Navigation and Search Strategy

The low‑altitude UAV follows a three‑stage pipeline:

Goal selection and waypoint planning : The centroid of the highest‑confidence region in the probability map is chosen as the navigation goal. Key waypoints are generated using the A* algorithm on a discretized occupancy grid derived from the global map.

Reinforcement‑learning‑based obstacle avoidance : A policy network trained with Proximal Policy Optimization (PPO) consumes LiDAR or depth inputs and outputs velocity commands that respect safety constraints while following the planned waypoints.

Visual‑language target detection : A vision‑language model (e.g., CLIP‑based detector) processes the onboard RGB stream to locate the target within the local field of view and refines the final pose estimate.

Scalability to Multi‑UAV Scenarios

After the high‑altitude UAV produces a probability map, multiple peaks are extracted as candidate target locations. An assignment problem is solved with the Hungarian algorithm to allocate each candidate to a distinct low‑altitude UAV, ensuring balanced workload and collision‑free operation. The same RL obstacle‑avoidance policy and visual‑language detector are reused on each low‑altitude platform.

Implementation Details

Simulation environment: OpenUAV (compatible with AirSim) was used for initial evaluation.

Training data: Both synthetic trajectories from the simulator and a subset of real‑world flights were combined to improve domain transfer.

Model versions: Pilot‑LLM employs GPT‑4‑Turbo as the LLM backbone; the visual encoder is a ResNet‑50 pretrained on ImageNet; the decoder has 4 convolutional layers (kernel size 3, stride 1) with a final softmax activation.

Performance metrics: On HaL‑13k, the collaborative system achieves a mean target localization error of 0.42 m, compared to 1.87 m for a single‑UAV baseline, while maintaining a 95 % success rate in obstacle‑free navigation.

Project homepage: https://rey-nard.github.io/AeroDuo_project/

dataset UAV multimodal perception AeroDuo collaborative navigation Pilot-LLM