Spatial Reasoning in Multimodal LLMs via CoT Distillation and Monte Carlo Tree Search for Dutch Facade-Element Detection
Abstract
In this exploratory work, we investigate the spatial reasoning capabilities of state-of-the-art Multimodal Large Language Models (MLLMs) in the context of building renovation. We evaluate models with different capacities, including GPT-4o, Qwen, Mulberry, and SpatialRGPT on a few-shot dataset of Dutch residential buildings, assessing their performance in visual reasoning tasks. To this end, we propose DuTCh Space, a novel application of Chain-of-Thought Distillation consisting of a Dual-Teacher Framework that leverages both step-by-step rationales and scene graph description augmentation to guide and assess student modelsβ performance on spatial reasoning. This structured supervision enables iterative refinement of spatial reasoning through domain-task decomposition and scene understanding. Additionally, we integrate Monte Carlo Tree Search at inference time to improve reasoning-path selection under visual uncertainty. By combining distillation and MCTS, we observe a measurable reduction in hallucinations, with models generating more grounded and verifiable predictions. Our findings demonstrate that reasoning-enhanced models can compensate for limited visual grounding even without scene graph augmentation, offering a scalable path toward spatially-aware MLLMs in low-resource, domain-specific settings
β
π Live Demo: Results
Thesis Presentation (PDF)
Project Overview
This project investigates the potential of state-of-the-art Multimodal Large Language Models (MLLMs) to identify architectural features relevant to energy-efficient building renovation in Dutch residential facades.
We propose DuTCh SpaCE β an Agentic AI Framework combining Dual-Teacher Chain-of-Thought Distillation with Monte Carlo Tree Search (CoMCTS) β enabling AI agents to reason step-by-step, self-correct, and improve spatial predictions in zero-shot and few-shot setups.
Our evaluation compares reasoning-based and spatially-grounded MLLM architectures, exploring their trade-offs in domain-specific visual tasks.
Motivation
- Problem: Manual facade assessments are costly and slow.
- Gap: Current vision models lack domain-specific architectural knowledge.
- Opportunity: MLLMs can leverage contextual reasoning for zero-shot detection of renovation-relevant features.
Target Features
Grouped by reasoning complexity:
- Visual Recognition β weep holes, chimneys, balconies.
- Geometric Inference β dormers, pitched roofs, window counts.
- Semantic Understanding β attics as living spaces, ventilation systems.
- Context Analysis β vegetation growth, photovoltaic panels.
Research Questions
Main RQ
Are SoTA Multimodal LLMs beneficial for identifying housing renovation concepts on Dutch building facades?
- RQ1: Compare CoT reasoning (Qwen) vs. 3D scene graph methods (SpatialRGPT) in zero-shot.
- Performance vs. GPT-4o.
- Bounding box guidance effects.
- RQ2: How can CoT reasoning MLLMs be enhanced for spatial recognition?
- Scene graph augmentation.
- LoRA fine-tuning capabilities.
Contributions
- First systematic MLLM evaluation on real Dutch facade data.
- DuTCh SpaCE: Agentic Dual-Teacher CoT distillation to reduce hallucinations.
- Reasoning vs. grounding trade-off analysis.
- LoRA + Distillation + Test-time Search pipeline for low-data domains.
- Agentic reasoning integration: enabling AI agents to autonomously generate, critique, and refine reasoning steps collectively.
Methodology
- Models under study: GPT-4o, SpatialRGPT (base/bbox), Qwen2-7B-VL.
- Data: 45 curated Dutch facade images with feature annotations (yes/no/unknown/count).
- Evaluation: Zero-shot + LoRA fine-tuning, 10 runs per config, metrics: Accuracy, Balanced Accuracy, F1, MAE/MSE.
- Agentic inference: Student LLMs act as autonomous reasoning agents, generating, critiquing, and selecting the most accurate reasoning paths via MCTS.
Key Findings
- GPT-4o leads across metrics.
- Bounding box guidance improves SpatialRGPT by ~15% in accuracy.
- DuTCh SpaCE reduces hallucinations significantly through agentic reasoning.
- LoRA fine-tuning narrows GPT-4o gap from 20% to 8%.
- Scene graph augmentation yields no improvement for Qwen.
Limitations
- Small dataset (45 images).
- Imbalanced feature presence (mitigated by Balanced Accuracy Metric).
- Scene graph integration underused by some architectures.
- Limited test-time search iterations.
Future Work
- Scale dataset to 1000+ images via scraping + filtering.
- Fine-tune multimodal encoders for vision grounding.
- Increase CoMCTS iterations and diversity for multi-agent reasoning.
- Explore RLHF strategies, multi-adapter LoRA, grounded CoT, DoRA.
Conclusion
MLLMs are beneficial for facade analysis when enhanced with agentic reasoning frameworks like DuTCh SpaCE.
Agentic reasoning can compensate for limited visual grounding, allowing domain expertise to rival raw model scale.