Spatial Reasoning in Multimodal LLMs via CoT Distillation and Monte Carlo Tree Search for Dutch Facade-Element Detection

Abstract

In this exploratory work, we investigate the spatial reasoning capabilities of state-of-the-art Multimodal Large Language Models (MLLMs) in the context of building renovation. We evaluate models with different capacities, including GPT-4o, Qwen, Mulberry, and SpatialRGPT on a few-shot dataset of Dutch residential buildings, assessing their performance in visual reasoning tasks. To this end, we propose DuTCh Space, a novel application of Chain-of-Thought Distillation consisting of a Dual-Teacher Framework that leverages both step-by-step rationales and scene graph description augmentation to guide and assess student models’ performance on spatial reasoning. This structured supervision enables iterative refinement of spatial reasoning through domain-task decomposition and scene understanding. Additionally, we integrate Monte Carlo Tree Search at inference time to improve reasoning-path selection under visual uncertainty. By combining distillation and MCTS, we observe a measurable reduction in hallucinations, with models generating more grounded and verifiable predictions. Our findings demonstrate that reasoning-enhanced models can compensate for limited visual grounding even without scene graph augmentation, offering a scalable path toward spatially-aware MLLMs in low-resource, domain-specific settings

–

🔗 Live Demo: Results

Thesis Presentation (PDF)

Download Project Poster (PDF)

Project Overview

This project investigates the potential of state-of-the-art Multimodal Large Language Models (MLLMs) to identify architectural features relevant to energy-efficient building renovation in Dutch residential facades.

We propose DuTCh SpaCE — a Dual-Teacher Chain-of-Thought Distillation framework combined with Monte Carlo Tree Search (CoMCTS) — to enhance spatial reasoning and reduce hallucinations in zero-shot and few-shot setups.

Our evaluation compares reasoning-based and spatially-grounded MLLM architectures, exploring their trade-offs in domain-specific visual tasks.

Motivation

Problem: Manual facade assessments are costly and slow.
Gap: Current vision models lack domain-specific architectural knowledge.
Opportunity: MLLMs can leverage contextual reasoning for zero-shot detection of renovation-relevant features.

Target Features

Grouped by reasoning complexity:

Visual Recognition — weep holes, chimneys, balconies.
Geometric Inference — dormers, pitched roofs, window counts.
Semantic Understanding — attics as living spaces, ventilation systems.
Context Analysis — vegetation growth, photovoltaic panels.

Research Questions

Main RQ
Are SoTA Multimodal LLMs beneficial for identifying housing renovation concepts on Dutch building facades?

RQ1: Compare CoT reasoning (Qwen) vs. 3D scene graph methods (SpatialRGPT) in zero-shot.
- Performance vs. GPT-4o.
- Bounding box guidance effects.
RQ2: How can CoT reasoning MLLMs be enhanced for spatial recognition?
- Scene graph augmentation.
- LoRA fine-tuning capabilities.

Contributions

Riccardo Campanella

Spatial Reasoning in Multimodal LLMs via CoT Distillation and Monte Carlo Tree Search for Dutch Facade-Element Detection

Abstract

🔗 Live Demo: Results

Thesis Presentation (PDF)

Project Overview

Motivation

Target Features

Research Questions

Contributions

Methodology

Key Findings

Limitations

Future Work

Conclusion

Share on