Profile Avatar
๐ŸŒ View Full Research โ†’

DuTCh SpaCE: Spatial Reasoning in Multimodal LLMs

Chain-of-Thought Distillation and Monte Carlo Tree Search for Dutch Facade-Element Detection

Abstract

โ–ผ

๐Ÿ† Key Achievements

Chain-of-Thought Distillation

Novel dual-teacher framework with step-by-step rationales and scene graph augmentation

Reduced Hallucinations

Measurable reduction in false predictions through structured supervision

MCTS Integration

Monte Carlo Tree Search for improved reasoning-path selection

Scalable Framework

Domain-specific spatial reasoning without extensive visual grounding

๐Ÿ—๏ธ DuTCh Space Framework Architecture

Our novel dual-teacher framework combining Chain-of-Thought distillation with Monte Carlo Tree Search for enhanced spatial reasoning in architectural feature detection.

DuTCh SpaCE Framework Architecture

Accuracy Performance

Accuracy measures the proportion of correct predictions across all feature classifications. Our Mulberry-Qwen DCoT demonstrates excellent accuracy on key architectural features, particularly excelling in weep holes and balconies detection.

๐Ÿ“Š SELECT MODELS
๐Ÿ’ก Tip: Click on the model buttons below to show/hide specific models in the chart.
๐ŸŽฏ Our Accuracy Peak
0.778
Weep holes & Balconies
๐Ÿ  Architectural impact
0.667
Chimneys & Vegetation
๐Ÿ’ช Robust Detection
0.556
Facade & Dormers
๐Ÿ“ˆ Scene Graph Impact
No Change
DCoT already optimal

F1-Score Performance

F1-Score provides the harmonic mean of precision and recall. Our Mulberry-Qwen DCoT approach achieves strong performance across multiple architectural features, demonstrating the effectiveness of Chain-of-Thought distillation in spatial reasoning tasks.

๐Ÿ“Š SELECT MODELS
๐Ÿ’ก Tip: Click on the model buttons below to show/hide specific models in the chart.
๐ŸŽฏ Our F1 Achievement
0.438
Balconies Detection
๐Ÿ”„ DCoT Consistency
Identical
DCoT = DCoT+Scene
๐Ÿ“Š Strong Features
4 Top-Tier
Vegetation, Balconies, Dormers, Chimneys
๐Ÿ† Best Overall
GPT-4o
0.742 (Vegetation)

Recall Performance

Recall measures the proportion of true positive predictions among all actual positive cases, indicating model sensitivity. Our DCoT approach maintains consistent recall performance, successfully capturing architectural features with balanced sensitivity across the dataset.

๐Ÿ“Š SELECT MODELS
๐Ÿ’ก Tip: Click on the model buttons below to show/hide specific models in the chart.
๐ŸŽฏ Our Recall Peak
0.500
Vegetation & Balconies
๐Ÿ“Š Consistent Pattern
0.333
Most features stable
๐Ÿ”„ DCoT Reliability
Identical
DCoT = DCoT+Scene
๐Ÿ“ˆ Sensitivity Balance
Stable
Good feature capture

Precision Performance

Precision measures the proportion of true positive predictions among all positive predictions, indicating model specificity. Our DCoT approach shows competitive precision, effectively minimizing false positives in architectural feature detection.

๐Ÿ“Š SELECT MODELS
๐Ÿ’ก Tip: Click on the model buttons below to show/hide specific models in the chart.
๐ŸŽฏ Our Precision Peak
0.389
Balconies Detection
๐Ÿ—๏ธ Low False Positives
0.333
Vegetation Growth
๐Ÿ” Reliable Features
0.222
Chimneys Detection
โš–๏ธ Precision-Recall Balance
Stable
Consistent across features