Pony.ai Perception System: Combining Traditional and Deep Learning Methods for 2D and 3D Object Detection
This article outlines Pony.ai's perception pipeline, comparing traditional and deep‑learning approaches for 2D and 3D object detection, detailing sensor fusion, detection methods, challenges such as occlusion and distance estimation, and how hybrid techniques improve accuracy for autonomous driving.
Pony.ai's perception system models the surrounding world to provide crucial information—such as object position, velocity, and direction—to downstream planning and control modules in autonomous driving.
Perception in Pony.ai involves a series of steps: sensor acquisition (LiDAR, cameras, millimeter‑wave radar), frame‑level processing (sensor fusion, segmentation, detection, classification), object tracking across frames, and road‑feature analysis (traffic lights, signs). Without reliable perception, the vehicle cannot react to its environment.
2D Object Detection focuses on camera‑based inputs. Traditional methods rely on sliding windows, hand‑crafted features (Harris corners, Canny edges) and classifiers such as SVM, but suffer from preset anchor boxes and low‑dimensional features. Deep learning with convolutional neural networks (CNNs) overcomes these limitations by extracting high‑dimensional features, using ROI pooling and RPN to share features across the whole image, and enabling real‑time detection.
Current CNN‑based 2D detectors fall into two categories:
Anchor‑based methods (e.g., RCNN, Fast/Faster RCNN, SSD/DSSD, YOLO v1‑v3, RetinaNet) that still use predefined boxes.
Anchor‑free methods (e.g., CornerNet, FSAF, FCOS) that directly regress object presence and size at each feature‑pyramid location.
While cameras excel at detecting distant objects, they face challenges such as overlapping objects, illumination variations, and lack of direct depth measurement, which can lead to missed detections or ambiguous classifications.
3D Object Detection leverages LiDAR point clouds to obtain spatial coordinates, enabling estimation of object size, orientation, and distance. Traditional 3D segmentation methods (Flood Fill, DBSCAN, Graph Cut) cluster points based on distance, density, or intensity, but suffer from over‑segmentation (splitting a single vehicle) and under‑segmentation (merging nearby pedestrians).
Deep‑learning‑based 3D detection incorporates point‑cloud features into CNNs, aggregates multi‑frame information, and predicts object pose and velocity. This approach mitigates the “three‑people‑as‑one‑car” issue by using distance cues to separate close objects.
Limitations of deep learning include uncontrollable outputs due to massive parameter counts, incomplete recall (e.g., confusing trash cans with pedestrians), and over‑fitting to specific datasets, which can degrade performance in new environments.
To address these issues, Pony.ai combines traditional and deep‑learning results through:
Using deep‑learning segmentation to refine traditional segmentation.
Supplementing deep‑learning recall with traditional segmentation outputs.
Applying multi‑frame probabilistic fusion (e.g., Markov or Bayesian models) for smoother predictions.
Real‑world road‑test videos demonstrate that 2D detection struggles with dense object stacks, while 3D detection, aided by depth information, better separates occluded objects, though challenges remain for far‑range detection.
Overall, the perception stack continuously evolves by integrating classic algorithms with modern deep‑learning techniques to improve reliability and safety in autonomous driving.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.