LMDrive: Closed-Loop End-to-End Driving with Large Language Models

CUHK MMLab SenseTime Research CPII under InnoHK
University of Toronto Shanghai AILab

An end-to-end, closed-loop, language-based autonomous driving framework, which interacts with the dynamic environment via multi-modal multi-view sensor data and natural language instructions.

Abstract

Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand, large language models (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand, previous autonomous driving methods tend to rely on limited-format inputs (e.g. sensor data and navigation waypoints), restricting the vehicle's ability to understand language information and interact with humans.

  1. We propose a novel end-to-end, closed-loop, language-based autonomous driving framework, LMDrive, which interacts with the dynamic environment via multi-modal multi-view sensor data and natural language instructions.
  2. We provide a dataset with about 64K data clips, where each clip includes one navigation instruction, several notice instructions, a sequence of multi-modal multi-view sensor data, and control signals. The duration of the clip spans from 2 to 20 seconds.
  3. We present the benchmark LangAuto for evaluating the autonomous agents that take language instructions as navigation inputs, which include misleading/long instructions and challenging adversarial driving scenarios.
  4. We conduct extensive closed-loop experiments to demonstrate the effectiveness of the proposed framework, and analyze different components of LMDrive to shed light on continuing research along this direction.

LMDrive Dataset

We provide a dataset with about 64K data clips, where each clip includes one navigation instruction, several notice instructions, a sequence of multi-modal multi-view sensor data, and control signals. The duration of the clip spans from 2 to 20 seconds.

Two examples of the collected data with corresponding labeled navigation instructions and optional notice instructions.


Type Three randomly chosen instructions of each instruction type
Follow Maintain your current course until the upcoming intersection.
In [x] meters, switch to left lane.
Ease on to the left and get set to join the highway.
Turn After [x] meters, take a left.
At the next intersection, just keep heading straight, no turn.
You’ll be turning left at the next T-junction, alright?
Others Feel free to start driving.
Slow down now.
Head to the point, next one’s [x] meters ahead, [y] meters left/right.
Notice Watch for walkers up front.
Just a heads up, there’s a bike ahead.
Please be aware of the red traffic signal directly in front of you.

Examples of considered navigation instructions (follow, turn, others) and notice instructions. [x] and [y] represent the float number for a specific distance.

Pipeline

The structure of the proposed LMDrive model, which consists of two major components:

  1. A vision encoder that processes multi-view multi-modal sensor data (camera and LiDAR) for scene understanding and generating visual tokens.
  2. A large language model and its associated component (tokenizer, Q-Former, and adapters) that processes all the historic visual tokens and the language instructions (navigation instruction and optional notice instruction), to predict the control signal and whether the given instruction is completed.

The detailed structure of the vision encoder, which takes as input the multi-view multi-modality sensor data:

  1. In the pre-training stage, the vision encoder is appended with prediction headers to perform pre-training tasks (object detection, traffic light status classification, and future waypoint prediction).
  2. In the instruction-finetuning stage and inference stage, the prediction headers are discarded, and the vision encoder is frozen to generate visual tokens to feed into the LLM.

LangAuto Benchmark & Performance

The LangAuto Benchmark is the first to evaluate closed-loop driving with language instructions in CARLA. It differs from previous benchmarks like Town05 and Longest6 by using natural language instructions instead of discrete commands or waypoints.

Features

  • Uses natural language to guide the vehicle to the destination, incorporating appropriate notice for enhanced safety.
  • Covers all 8 towns in CARLA, featuring various scenarios (highways, intersections, roundabouts) and 16 environmental conditions, including 7 weather and 3 daylight conditions.
  • Supports different tracks, providing a diverse range of driving challenges and scenarios.
  • About 5% of the instructions are intentionally misleading, lasting 1-2 seconds. The agent must identify and ignore these instructions for safe navigation.

Tracks

  • LangAuto Track: Navigation instructions are updated based on the agent's position. Includes three sub-tracks (Tiny/ Short/ Long) for different route lengths.
  • LangAuto-Notice Track: Adds notice instructions to simulate real-time assistance in complex scenarios.
  • LangAuto-Sequential Track: Combines consecutive instructions into a single long instruction, mimicking real-world navigation software.

Performance

Demos

Click to Play the demo cases!

Navigation instruction

Navigation instruction (with distance)

Misleading instruction

Notice instruction

BibTeX


        @misc{shao2023lmdrive,
              title={LMDrive: Closed-Loop End-to-End Driving with Large Language Models}, 
              author={Hao Shao and Yuxuan Hu and Letian Wang and Steven L. Waslander and Yu Liu and Hongsheng Li},
              year={2023},
              eprint={2312.07488},
              archivePrefix={arXiv},
              primaryClass={cs.CV}
        }