iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning (2025)
by Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker, and Abhishek Aich
Abstract:
Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e. no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues—object pose, lane positions, and object trajectories—which are hierarchically organized into frameand video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM’s outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder’s proposed grounding with domain-specific cues—especially object orientation and global context—significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.
Download Information
Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker, and Abhishek Aich (2025). "iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning." Advances in Neural Information Processing Systems, vol 38.
|  |
|
|
|
|
Bibtex citation
@inproceedings{Yaoetal25,
author = "Manyi Yao and Bingbing Zhuang and Sparsh Garg and Amit Roy-Chowdhury and Christian Shelton and Manmohan Chandraker and Abhishek Aich",
title = "{iFinder}: Structured Zero-Shot Vision-Based {LLM} Grounding for Dash-Cam Video Reasoning",
booktitle = "Advances in Neural Information Processing Systems",
booktitleabbr = "NeurIPS",
year = 2025,
volume = 38,
}
full list