Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand, large language models (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand, previous autonomous driving methods tend to rely on limited-format inputs (e.g. sensor data and navigation waypoints), restricting the vehicle's ability to understand language information and interact with humans.
We provide a dataset with about 64K data clips, where each clip includes one navigation instruction, several notice instructions, a sequence of multi-modal multi-view sensor data, and control signals. The duration of the clip spans from 2 to 20 seconds.
Two examples of the collected data with corresponding labeled navigation instructions and optional notice instructions.
Type | Three randomly chosen instructions of each instruction type |
---|---|
Follow | Maintain your current course until the upcoming intersection. In [x] meters, switch to left lane. Ease on to the left and get set to join the highway. |
Turn | After [x] meters, take a left. At the next intersection, just keep heading straight, no turn. You’ll be turning left at the next T-junction, alright? |
Others | Feel free to start driving. Slow down now. Head to the point, next one’s [x] meters ahead, [y] meters left/right. |
Notice | Watch for walkers up front. Just a heads up, there’s a bike ahead. Please be aware of the red traffic signal directly in front of you. |
Examples of considered navigation instructions (follow, turn, others) and notice instructions. [x] and [y] represent the float number for a specific distance.
The structure of the proposed LMDrive model, which consists of two major components:
The detailed structure of the vision encoder, which takes as input the multi-view multi-modality sensor data:
The LangAuto Benchmark is the first to evaluate closed-loop driving with language instructions in CARLA. It differs from previous benchmarks like Town05 and Longest6 by using natural language instructions instead of discrete commands or waypoints.
Click to Play the demo cases!
Navigation instruction
Navigation instruction (with distance)
Misleading instruction
Notice instruction
@misc{shao2023lmdrive,
title={LMDrive: Closed-Loop End-to-End Driving with Large Language Models},
author={Hao Shao and Yuxuan Hu and Letian Wang and Steven L. Waslander and Yu Liu and Hongsheng Li},
year={2023},
eprint={2312.07488},
archivePrefix={arXiv},
primaryClass={cs.CV}
}