DuTCh Space: Spatial Reasoning in Multimodal LLMs

DuTCh SpaCE: Spatial Reasoning in Multimodal LLMs

Chain-of-Thought Distillation and Monte Carlo Tree Search for Dutch Facade-Element Detection

Abstract

▼

🏆 Key Achievements

Chain-of-Thought Distillation

Novel dual-teacher framework with step-by-step rationales and scene graph augmentation

Reduced Hallucinations

Measurable reduction in false predictions through structured supervision

MCTS Integration

Monte Carlo Tree Search for improved reasoning-path selection

Scalable Framework

Domain-specific spatial reasoning without extensive visual grounding

🏗️ DuTCh Space Framework Architecture

Our novel dual-teacher framework combining Chain-of-Thought distillation with Monte Carlo Tree Search for enhanced spatial reasoning in architectural feature detection.

Accuracy Performance

Accuracy measures the proportion of correct predictions across all feature classifications. Our Mulberry-Qwen DCoT demonstrates excellent accuracy on key architectural features, particularly excelling in weep holes and balconies detection.

📊 SELECT MODELS

💡 Tip: Click on the model buttons below to show/hide specific models in the chart.

🎯 Our Accuracy Peak

0.778

Weep holes & Balconies

🏠 Architectural impact

0.667

Chimneys & Vegetation

💪 Robust Detection

0.556

Facade & Dormers

📈 Scene Graph Impact

No Change

DCoT already optimal

F1-Score Performance

F1-Score provides the harmonic mean of precision and recall. Our Mulberry-Qwen DCoT approach achieves strong performance across multiple architectural features, demonstrating the effectiveness of Chain-of-Thought distillation in spatial reasoning tasks.

📊 SELECT MODELS

💡 Tip: Click on the model buttons below to show/hide specific models in the chart.

🎯 Our F1 Achievement

0.438

Balconies Detection

🔄 DCoT Consistency

Identical

DCoT = DCoT+Scene

📊 Strong Features

4 Top-Tier

Vegetation, Balconies, Dormers, Chimneys

🏆 Best Overall

GPT-4o

0.742 (Vegetation)

Recall Performance

Recall measures the proportion of true positive predictions among all actual positive cases, indicating model sensitivity. Our DCoT approach maintains consistent recall performance, successfully capturing architectural features with balanced sensitivity across the dataset.

📊 SELECT MODELS

💡 Tip: Click on the model buttons below to show/hide specific models in the chart.

🎯 Our Recall Peak

0.500

Vegetation & Balconies

📊 Consistent Pattern

0.333

Most features stable

🔄 DCoT Reliability

Identical

DCoT = DCoT+Scene

📈 Sensitivity Balance

Stable

Good feature capture

Precision Performance

Precision measures the proportion of true positive predictions among all positive predictions, indicating model specificity. Our DCoT approach shows competitive precision, effectively minimizing false positives in architectural feature detection.

📊 SELECT MODELS

💡 Tip: Click on the model buttons below to show/hide specific models in the chart.

🎯 Our Precision Peak

0.389

Balconies Detection

🏗️ Low False Positives

0.333

Vegetation Growth

🔍 Reliable Features

0.222

Chimneys Detection

⚖️ Precision-Recall Balance

Stable

Consistent across features