DriveGPT: “The ChatGPT for Autonomous Driving Car”
June 2023
tl;dr: The ChatGPT for Autonomous Driving Car
Overall impression
- Drive Language Model has been announced by Haomo [1], an autonomous driving startup in China. It is based on OpenAI’s GPT and utilizes a customized Decoder Block.
- I think that DriveGPT may have drawn inspiration from the paper Pix2Seq, as the tokenization approach used by Haomo appears to resemble Pix2Seq [2], particularly Pix2Seq V2 [3].
- I would like to highlight that the planning module can be divided into three sub-modules: global planning, behavior planning, and local planning. Let me provide more details on each:
- Global Planning: This high-level planning component determines the optimal route to reach the final destination. For example, we often use Google Maps to find the global path.
- Behavior Planning: This mid-level planning component determines real-time high-level actions needed to handle dynamic obstacles. For instance, actions may involve lane switching, turning, or stopping.
- Local Planning: This low-level component focuses on scene optimization to achieve behavior planning objectives.
- DriveGPT primarily focuses on behavior planning and local planning.
Key ideas
- Drive Language Tokens: These tokens represent perception signals such as object detection, object size, lane coordinates, etc., converted into token vectors.
- In the first scene (left image), for example, there are 5 dynamic cars with respective coordinates
,
,
,
and
respectively. Moreover, the scene has 3 lane locations
,
,
and the ego location
. The token vector should be [
]
- DriveGPT: Haomo have completely switched to the Transformer decoder-only architecture.

- Human Feedback Loop: Similar to OpenAI GPT, Haomo utilizes a human feedback loop to reward and rank generated token sequences from the model. Following Haomo’s approach, human planning regularly receives the highest score in the ranking system.
My Thoughts
From my perspective, Haomo’s approach appears to be highly theoretical and challenging to implement in practical products. Firstly, I am uncertain if they possess hardware capabilities that are truly robust enough to enable real-time inference of DriveGPT (at least at 20 FPS). In fact, even ChatGPT takes a substantial amount of time to generate a comprehensive response. Secondly, while ChatGPT is widely acknowledged as a valuable supportive tool, its answers are not always entirely accurate and occasionally include fabrications. Consequently, it is difficult to trust that the local planning derived from DriveGPT would not pose a danger to human lives in the vehicle. Ensuring safe driving is a paramount concern when addressing the autonomous vehicle problem. Overall, I believe Haomo should thoroughly consider this matter before proceeding with real-world implementation.
References
[1] Haomo 8th AI Day
[2] Pix2seq: A Language Modeling Framework for Object Detection
[3] A Unified Sequence Interface for Vision Tasks