Bi-orthogonal Factor Decomposition for Vision Transformers

Kempner Institute for the Study of Natural and Artificial Intelligence
Harvard University
* Equal contribution
Website Teaser

Abstract

Self-attention is the central computational primitive of Vision Transformers, yet we lack a principled understanding of what information attention mechanisms exchange between tokens. Attention maps describe where weight mass concentrates; they do not reveal whether queries and keys trade position, content, or both. We introduce Bi-orthogonal Factor Decomposition (BFD), a two-stage analytical framework: first, an ANOVA-based decomposition statistically disentangles token activations into orthogonal positional and content factors; second, SVD of the query-key interaction matrix QKT exposes bi-orthogonal modes that reveal how these factors mediate communication. After validating proper isolation of position and content, we apply BFD to state-of-the-art vision models and uncover three phenomena:

  1. Attention operates primarily through content. Content-content interactions dominate attention energy, followed by content-position coupling. DINOv2 allocates more energy to content-position than supervised models and distributes computation across a richer mode spectrum.
  2. Attention mechanisms exhibit specialization: heads differentiate into content-content, content-position, and position-position operators, while singular modes within heads show analogous specialization.
  3. DINOv2's superior holistic shape processing emerges from intermediate layers that simultaneously preserve positional structure while contextually enriching semantic content.

Overall, BFD exposes how tokens interact through attention and which informational factors – positional or semantic – mediate their communication, yielding practical insights into vision transformer mechanisms.

Method

Schematic of Bi-orthogonal Factor Decomposition into position and content components.

Bi-orthogonal Factor Decomposition (BFD) splits each ViT token activation into three orthogonal components: a global mean term, a positional component capturing where the token is, and a content component capturing what it represents. A token’s activation is approximated as the sum of these three factors, each expressed in its own basis, giving a clean view of geometry and semantics inside the same representation. We then project each attention head’s query–key interaction matrix onto these subspaces, decomposing it into different types of position- and content-driven interactions.

Factorization Cleanly Separates Position from Content

We validate BFD by testing whether positional information can still be decoded after removing the positional component. Linear probes easily recover token position from the raw token activations, confirming that the transformer stores positional information. However, the probes fail completely on the content residual extracted by BFD.

Layer-wise linear probing accuracy for position prediction using decomposed features.

Representative interaction motifs

The query (red) and key (cyan) singular vectors (paired modes) are projected onto either the content or positional factor, highlighting the image regions that most strongly activate each singular direction. Content–content interactions reveal object parts, while position–position interactions reveal spatial tracking.

Projections of query/key singular vectors onto content or positional factors. Hover to zoom

Content and Position Interaction Motifs in DINOv2.

Organization of information flow in self-attention

Attention budget is allocated more towards content-driven interactions

BFD decomposes each layer’s attention into factor pairs and shows that most attention energy flows through content–content and content–position interactions, especially in deeper layers. Interestingly, DINOv2 dedicates more energy to content–position coupling than the supervised ViT, indicating stronger localization-aware semantic integration.

Layer-wise stacked attention energy for different interaction types in ViT and DINOv2, showing content-driven interactions dominating the budget.

Distributed interactions in DINOv2

At a finer resolution, the content-driven energy in DINOv2 is distributed broadly across many modes rather than concentrated in a few dominant ones. In contrast, the supervised ViT not only allocates less overall energy to content-driven interactions, but also the energy is concentrated in a few dominant modes, indicating richer content-driven interaction motifs in DINOv2.

Spectral structure of interactions

DINOv2 shows a higher stable rank of the query–key interaction matrix, again indicating that attention relies on more available dimensions than the supervised ViT. Mode alignment patterns reveal that DINOv2 uses near-orthogonal asymmetric interaction types.

Spectral Structure. Hover to zoom

Functional specialization of modes

Measuring the relative contribution of content-, poition-, and layer-driven interactions reveals that modes implement specialized interactions.

Aggregated over layers

Modes cluster near vertices and edges instead of dispersing uniformly across the simplex.

Summary ternary plot of mode specializations aggregated over layers.

The vertices correspond to layer-, position-, and content-dominated operators, and points along edges reflect mixed operators. Hover over the image to see the density distribution.
 

Density ternary plot of mode specializations aggregated over layers.

Per-layer view

The per-layer view of mode density distributions for ViT (top) and DINOv2 (bottom).

Per-layer ternary plots of mode specializations for ViT and DINOv2. Hover to zoom

Dual preservation of position and content in DINOv2 explains its superior holistic shape capacity

Holistic processing requires the visual system to look beyond local cues by stitching the local features into a global percept. For example, in a puzzle-piece visual anagram, the same pieces can form different objects, so the processor must encode content-driven interactions while remaining spatially anchored, exactly the pattern that BFD reveals in DINOv2, which supposedly shows superior holistic shape capacity with these anagrams.

Puzzle-piece visual anagram where the same local pieces can form a wolf or an elephant.

Isolated positional code

In the isolated positional subspace, DINOv2 preserves a smooth, grid-like manifold across depth, whereas the supervised ViT rapidly collapses its spatial structure early on.

3D positional manifolds across layers for ViT and DINOv2. Hover to zoom

Isolated content code

Content similarity matrices show that in the supervised ViT, content components stay highly similar across layers—an “eye” unit mostly becomes a slightly better eye detector. In DINOv2, the content code changes substantially from block to block, as local parts are re-encoded in terms of larger configurations (e.g., an “eye in a face”), consistent with contextual enrichment rather than simple local refinement.

Layer-by-layer similarity matrices comparing activations and content components for ViT and DINOv2.

Visualizing interactions within attention heads

Each row visualizes a specific interaction mode of a DINOv2 attention head and each column shows an image's query projection in red and the key projection in cyan. Visualized modes are selected based on how strongly they are expressed across thousands of images using a co-activation score (product of top-k pooled query- and key-side token projections).


Content (query)-Content (key) interaction motifs within modes of DINOv2 block 11 and 12.

... Hover to zoom

Block 11

... Hover to zoom

Block 12

Position (query)-Position (key) interaction motifs within modes of DINOv2 block 2 and 10.

... Hover to zoom

Block 2

... Hover to zoom

Block 10

Position (query)-Content (key) interaction motifs within modes of DINOv2 block 7 and 11.

... Hover to zoom

Block 7

... Hover to zoom

Block 11

BibTeX


        @misc{doshi2026biorthogonalfactordecompositionvision,
          title={Bi-Orthogonal Factor Decomposition for Vision Transformers}, 
          author={Fenil R. Doshi and Thomas Fel and Talia Konkle and George Alvarez},
          year={2026},
          eprint={2601.05328},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2601.05328}, 
    }
      
-->