Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models

tl;dr

We isolate configural shape perception using visual anagrams by holding local parts constant, varying global arrangement, and testing whether vision models perceive shape as more than just a bag of parts. We find that language-aligned and self-supervised ViTs show sensitivity to these holistic shape structures. Surprisingly, this capacity is not fully predicted by ImageNet accuracy or shape-vs-texture bias. Finally, we perform mechanistic interventions to reveal that long-range contextual interactions in intermediate processing stages support this emergent capacity for holistic object recognition.

Abstract

Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs- texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self- supervised and language-aligned transformers – exemplified by DINOv2, SigLIP2 and EVA-CLIP – occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius- controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control, whose receptive fields straddle patch seams, remains at chance (iv), ruling out any "border-hacking" strategies. Finally, (v) we show that configural shape score also predicts other shape- dependent evals (e.g.,foreground bias, spectral and noise robustness). Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape.

Object Anagrams

To probe whether vision models can perceive shape as more than just a collection of parts, we use Object Anagrams — pairs of images that contain the same local features (i.e., identical puzzle pieces) but arranged in different global configurations. These images were generated using the multi-view denoising pipeline introduced by (Geng et al.) which constructs image pairs by recombining the same 16 parts sampled from a pretrained diffusion model. Using this method, we generated a dataset of 72 object-anagram pairs spanning 9 object categories, ensuring that each pair contains two distinct, recognizable shapes constructed from the same parts.

Measuring Configural Shape Perception in Vision Models

The Configural Shape Score (CSS) tests whether a model can correctly classify both images in an object-anagram pair — images made from the same local patches but arranged into different global shapes. This makes CSS a stringent test of holistic processing: local texture shortcuts won't help. Across 86 vision models, CSS reveals striking variation. The top performers — self-supervised ViTs like DINOv2 and language-aligned models like SigLIP2 and EVA-CLIP — approach human-level scores. In contrast, many supervised models with similar ImageNet accuracy perform far worse, highlighting that recognition performance alone does not guarantee configural shape sensitivity. Efforts to improve "shape bias" through stylization, adversarial robustness, or sparsity offer little benefit here. While shape-vs-texture bias and CSS are moderately correlated, they capture distinct aspects of model behavior. CSS reflects a model's ability to integrate relational structure — not just a preference for shape over texture.

What mechanism underlies high Configural Shape Scores?

Models can cheat on configural shape tasks by exploiting subtle boundary artifacts between parts, rather than truly processing global structure. Our mechanistic analysis disentangles this shortcut from genuine configural sensitivity by probing how and where global configuration is computed.

Identifying the mechanism via Attentional Ablation

We developed an attentional ablation method to test whether Vision Transformers rely on local or long-range interactions for configural shape perception. By masking self-attention for patch tokens within each layer, we selectively removed either nearby or distant patch-to-patch interactions, while keeping the class token's attention fully intact. This selective disruption isolates the causal role of local vs. global context. We then measured how these changes affected both the final class token and the model's Configural Shape Score.

Configural Perception relies on Long-range Contextual Interaction

Ablation results show that high-CSS models like DINOv2-B/14 depend heavily on long-range attention to perceive global shape. When long-range interactions were removed ("attend inside"), both class token stability and Configural Shape Scores dropped sharply — especially in intermediate layers. But when these interactions were preserved ("attend outside"), performance remained largely intact. In contrast, low-CSS models like ViT-B/16 were not affected as much by the same ablations. These findings provide direct mechanistic evidence that emergent long-range interactions play a critical role in enabling vision transformers to successfully encode global configural structure.

"U-Trend" : Capacity hidden in intermediate layers

Ablation results revealed a striking U-shaped pattern across transformer layers. In DINOv2-B/14, removing long-range attention ("attend inside") had little effect in early and late layers, but caused major disruption in the middle — reducing both class token stability and Configural Shape Scores. This suggests that early layers encode primarily local features, while intermediate blocks integrate spatial context to construct holistic representations. Just like contextual processing in language models, configural understanding emerges midway through the stack — pointing to the intermediate layers as the bottleneck for building globally coherent object representations.

Representational Shift from Puzzle Pieces to Categories

We used representational similarity analysis to test how models encode configural shape using a controlled set of image pairs that independently varied object category and puzzle part composition. High-CSS models like EVA-CLIP began with higher part-based similarity in early layers, but midway through the network, representations shifted to reflect object category over local features. By the final layers, images with different parts but the same category were more similar than anagram pairs with identical parts. In contrast, low-CSS models like ResNet-50 retained strong part-based similarity throughout, showing only weak category abstraction at the end. These results show that configural shape perception involves a mid-to-late transition — from encoding local pieces to capturing relational structure among parts.

From Configural Sensitivity to General Shape Understanding

To test whether configural shape sensitivity generalizes to other shape-relevant capacities, we evaluated whether the Configural Shape Score (CSS) predicts performance across a diverse set of shape-dependent benchmarks. These included: (1) Robustness to Noise (Hendryks & Dietterich), assessing model accuracy across various corruptions; (2) Foreground-vs-Background Bias (Xiao et al.), measuring whether models rely more on object than background features; (3) Phase Dependence (Garity et al.), which probes sensitivity to Fourier phase information; and (4) Critical Band Masking (Subramanian et al.), which evaluates how spatial frequency filtering affects recognition. We found that CSS was a strong predictor of performance across all four benchmarks, significantly outperforming shape-vs-texture bias as a predictor. This suggests that models with high CSS not only succeed on the anagram task, but also exhibit broader shape-oriented representations.

Feature Attribution Maps

To complement our quantitative findings, we visualize feature attributions for object images using Integrated gradients applied to a stylized ResNet50 and DINOv2-B/14. Despite being trained to reduce texture reliance, the stylized ResNet50 still attends to local regions — often missing the object's global outline. In contrast, DINOv2-B/14 highlights coherent, whole-object maps, even under challenging distortions or non-canonical appearances. These visualizations offer intuitive evidence for the mechanisms inside high CSS models that would lead to more complete object-level activation maps.

Citation


          @misc{doshi2025visualanagramsrevealhidden,
            title={Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models}, 
            author={Fenil R. Doshi and Thomas Fel and Talia Konkle and George Alvarez},
            year={2025},
            eprint={2507.00493},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2507.00493}, 
      }