Hi! I'm Pranav, a senior in ECE at BITS Pilani, Goa, India. I'm interested in 3D Computer Vision, 3D Scene Understanding and Multi-Modal Generative Modelling. Currently, I'm at Carnegie Mellon University's Robotics Institute pursuing my undergraduate thesis under the supervision of Dr. Wenshan Wang at AirLab and Dr. Ji Zhang. Here, I have worked on Vision Language Navigation in Outdoor Scenarios and I am currently working on 3D Object Representation Learning.
Since my sophomore year, I have also been remotely working at MARMoT Lab, NUS under the supervision of Prof. Guillaume Sartoretti. Here, I have worked on 3D Scene Graph Generation using VLMs, Next-Best-View Policy for NeRFs (co-supervised by Prof. Marija Popović), Open-Vocabulary Gaussian Splatting.
At my university, I am grateful to have worked with Prof. Neena Goveas, Prof. Naveen Gupta, Prof. Sarang Dhongdi, and Prof. Tanmay Tulsidas Verlekar.
Beyond academics, I enjoy watching series, movies, travelling and playing sports. I also love tinkering with drones, and I founded an autonomous drone team in my sophomore year ;)
UAV-VLN: End-to-End Vision Language guided Navigation for UAVs
Pranav Saxena, Nishant Raghuvanshi, Neena Goveas
ECMR 2025  (Oral Presentation)
arXiv
UAV-VLN leverages the common-sense reasoning of LLMs and a vision model for cross-modal grounding to plan context-aware aerial trajectories from free-form natural language instructions.
LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
Pranav Saxena, Avigyan Bhattacharya, Ji Zhang, Wenshan Wang
Human-aware Embodied AI Workshop @ IROS 2025  
arXiv, Code
A zero-shot hybrid pipeline that leverages Large Language Models for symbolic reasoning and Vision-Language Models for fine-grained attribute extraction to perform robust referential grounding in challenging outdoor driving scenarios.
ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models
Pranav Saxena, Jimmy Chiun
Preprint  
arXiv
Zero-shot framework for incremental 3D scene graph generation using vision-language models, enabling open-vocabulary reasoning and geometric grounding for embodied environments.
Gen-LangSplat: Generalized Language Gaussian Splatting with Pre-Trained Feature Compression
Pranav Saxena
Preprint  
arXiv, Code
A generalized 3D Gaussian Splatting framework that eliminates per-scene training and improves cross-scene generalization via a ScanNet-trained autoencoder for CLIP feature compression. This work was done in collaboration with NUS and TU Delft.
3D Object Representation Learning
Carnegie Mellon University
Creating a generalized 3D encoder that can be used to represent 3D objects. The learned embeddings could be used for various downstream tasks like 3D Object Segmentation, 3D Shape Completion, 3D Captioning.
Next-Best-View Policy for NeRFs
National University of Singapore, TU Delft, Purdue University
Code
Developed an attention-based Next-Best-View (NBV) policy for Neural Radiance Fields to select the most informative viewpoints for 3D reconstruction. Used a Vision Transformer to encode multi-view image embeddings and candidate poses to predict ΔPSNR gain, supervised with ground-truth from pixelNeRF.
3D Human Pose Estimation using WiFi signals
BITS Pilani, Goa
Report
Developed a two-stage learning framework for 3D human pose estimation using WiFi signals, and incorporated phase sanitization to improve CSI quality. Validated on the MM-Fi dataset, achieving enhanced accuracy and robustness over raw CSI inputs.
Swarm Bots
BITS Pilani, Goa
Poster
Developed an algorithm for swarm robots by combining Artificial Potential Field and Reciprocal Velocity Obstacles methods to overcome the shortcomings of both methods.
Stock Price Prediction using Optimal Stopping Theory
BITS Pilani, Goa
Code, Report
This is the course project for MATH F424 Applied Stochastic Process. Developed a stock price prediction model using Optimal Stopping Theory and Geometric Brownian Motion to maximize returns by determining the optimal time to sell stocks. Backtested the model on historical NSE stock data over 10 years.