datvuthanh.github.io

Paper Notes

This repository contains my paper reading notes on deep learning and machine learning. It is inspired by Denny Britz, Daniel Takeshi and especially Patrick Langechuan Liu.

About Me

My name is Dat Vu, and I am currently leading the AI Team at PhenikaaX, a rapidly growing autonomous and industrial robot company where I serve as the Computer Vision Leader. I have a passion for seeking answers to complex questions and take great joy in exploring and deeply understanding mathematical concepts. You can see my publications here.

My ML/DL Notes

You can read my notes here.

2023-11

OpenCalib: A Multi-sensor Calibration Toolbox for Autonomous Driving [Guohang Yan]

2023-10

2023-09

Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [Aditya Prakash]

2023-08

PaLM-E: An Embodied Multimodal Language Model [Danny Driess]
Vision-Language Models for Vision Tasks: A Survey [Jingyi Zhang]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [Anthony Brohan]
RT-1: Robotics Transformer for Real-World Control at Scale [Anthony Brohan]

2023-07

VN-Transformer: Rotation-Equivariant Attention for Vector Neurons [Serge Assaad]
LightGlue: Local Feature Matching at Light Speed [Philipp Lindenberger]
Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline NeurIPS 2022 [Penghao Wu]
OpenLane-V2: A Topology Reasoning Benchmark for Scene Understanding in Autonomous Driving [Huijie Wang]
MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge NeurIPS 2022 [Linxi Fan]
Riemannian Score-Based Generative Modelling NeurIPS 2022 [Valentin De Bortoli]
Gradient Descent: The Ultimate Optimizer NeurIPS 2022 [Kartik Chandra]
Llama 2: Open Foundation and Fine-Tuned Chat Models [Hugo Touvron]
Dilated Neighborhood Attention Transformer [Ali Hassani]
Neighborhood Attention Transformer CVPR 2023 [Ali Hassani]
PlanT: Explainable Planning Transformers via Object-Level Representations CoRL 2022 [Katrin Renz]
Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving CVPR 2023 [Xiaosong Jia]
Scaling Self-Supervised End-to-End Driving with Multi-View Attention Learning [Yi Xiao]
Gato: A Generalist Agent TMLR 2022 [Scott Reed]
Bytes Are All You Need: Transformers Operating Directly On File Bytes [Maxwell Horton]

2023-06

Vision Transformer with Deformable Attention [Zhuofan Xia]
Planning-oriented Autonomous Driving CVPR 2023 Best Paper [Yihan Hu]
MetaFormer Is Actually What You Need for Vision [Weihao Yu]
A Unified Sequence Interface for Vision Tasks [Ting Chen]
Pix2seq: A Language Modeling Framework for Object Detection [Ting Chen]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [Tri Dao]
Better plain ViT baselines for ImageNet-1k [Lucas Beyer]
FUTR3D: A Unified Sensor Fusion Framework for 3D Detection [Xuanyao Chen]
Iterative Deep Homography Estimation CVPR 2022 [Si-Yuan Cao]
Recurrent Homography Estimation Using Homography-Guided Image Warping and Focus Transformer CVPR 2023 [Si-Yuan Cao]
QLORA: Efficient Finetuning of Quantized LLMs [Tim Dettmers]
LORA: LOW-RANK ADAPTATION OF LARGE LAN- GUAGE MODELS [Notes] [Edward Hu]
Occupancy Networks: Learning 3D Reconstruction in Function Space CVPR 2019 [Andreas Geiger]
Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving [Occupancy Network, Zhao Hang]

2023-05

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [Occupancy Network, Wei Yi, Jiwen Lu]
LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation [Homography Estimation]
Transformer Feed-Forward Layers Are Key-Value Memories EMNLP 2021
OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction [Occupancy Network, PhiGent]

2023-04

ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries
SAM: Segment Anything [FAIR]
BEV-LaneDet: a Simple and Effective 3D Lane Detection Baseline CVPR 2023 [BEVNet]
BEVSegFormer: Bird’s Eye View Semantic Segmentation From Arbitrary Camera Rigs [BEVNet]

2023-03

VAD: Vectorized Scene Representation for Efficient Autonomous Driving [Horizon]
BEVPoolv2: A Cutting-edge Implementation of BEVDet Toward Deployment [BEVDet, PhiGent]
Differentiable Raycasting for Self-supervised Occupancy Forecasting
End-to-end Interpretable Neural Motion Planner
Safe Local Motion Planning with Self-Supervised Freespace Forecasting
DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries CoRL 2021 [BEVNet, transformers]

2023-02

TPVFormer: Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction CVPR 2023 [Occupancy Network, Jiwen Lu]
ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning ECCV 2022 [Hongyang Li]
BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers ECCV 2022 [BEVNet, Hongyang Li, Jifeng Dai]
BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection [BEVNet]
BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving [Jiwen Lu, BEVNet, perception + prediction]
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation [BEVNet, Han Song]
PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark [BEVNet, lane line]
VectorMapNet: End-to-end Vectorized HD Map Learning [BEVNet, LLD, Hang Zhao]
PETR: Position Embedding Transformation for Multi-View 3D Object Detection ECCV 2022 [BEVNet]
PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images [BEVNet, MegVii]
M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation [BEVNet, nvidia]
BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection [BEVNet, NuScenes SOTA, Megvii]
CVT: Cross-view Transformers for real-time Map-view Semantic Segmentation CVPR 2022 oral [UTAustin, Philipp]
Wayformer: Motion Forecasting via Simple & Efficient Attention Networks [Behavior prediction, Waymo]
HDMapNet: An Online HD Map Construction and Evaluation Framework CVPR 2021 workshop [youtube video only, Li Auto]
FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras ICCV 2021 [BEVNet, perception + prediction]