ECCV 2026 · Accepted Paper

Mitigating Pose‑Scale Discrepancy Bias and Reforming Multi‑Support Reasoning for Few-Shot Semantic Segmentation

We propose Decoupled Matching Network (DMNet) which decouples what an object is from how it's posed, then fuses multiple supports with cyclic, reliability-weighted refinement - improving few-shot segmentation most where prior methods break down: large viewpoint and scale mismatch.

Shreya Biswas · Zhaozheng Yin

Department of Computer Science, Stony Brook University, Stony Brook, NY, USA

Paper Supplementary Code BibTeX Contact

Abstract

The problem with matching support and query objects

Few-shot semantic segmentation (FSS) often degrades when support and query dmnet-images exhibit large pose or scale differences, since conventional prototype matching operates in a geometry-entangled feature space. We propose a lightweight encoder–decoder framework that disentangles object representations into a geometry-invariant semantic code and a low-dimensional modulation code capturing instance-specific geometric variation.

The reconstruction decoder is trained with equivariance constraints so geometric information is absorbed by the modulation pathway while semantic consistency is preserved in a separate branch - improving robustness to pose-scale discrepancies before segmentation, without explicit cross-image alignment. To strengthen supervision from limited supports, we introduce a refinement module that synthesizes lightweight augmented views of each support and performs cyclic ensemble refinement for more stable predictions. Predictions from multiple supports are then fused with spatially adaptive reliability weighting, producing cleaner, better-aligned query predictions. Across standard FSS benchmarks, DMNet consistently improves performance - particularly under large viewpoint change - and ablations confirm both the disentanglement and the cyclic, spatially weighted refinement are critical to the gains.

Pose-scale discrepancy bias in few-shot segmentation, and multi-support fusion comparison — **The pose-scale discrepancy bias.** (a) FSS accuracy is highest when support and query objects share similar orientation and scale (red box), and degrades as the discrepancy grows. (b) Different supports yield different predictions for the same query - naive averaging and scene-level adjustment (BAM) both under-perform a reliability-aware fusion.

Method

DMNet: Decoupled Matching Network

Rather than estimating an explicit geometric transform to align support and query - brittle when object categories and instances are unseen at test time - DMNet takes the opposite approach: geometric variation should never be allowed to disrupt semantic matching in the first place.

The object-centric features of support and query are passed through a shared encoder and decomposed into two compact latent codes: a semantic code capturing geometry-invariant class identity, and a modulation code - visualized as a 2D geometric pointer - encoding instance-specific pose and scale. A decoder recombines both codes into reconstructed features under an equivariance constraint, so similarity transforms applied in the modulation subspace produce matching transforms in the reconstruction. A lightweight segmentation head D_seg then predicts the query mask from the geometry-decoupled features.

DMNet architecture diagram — **Overall architecture.** Object features are factorized into a geometry-invariant semantic code and an instance-specific modulation code (purple) via a disentangled autoencoder trained with an equivariance constraint, then refined through cyclic ensembling and spatially adaptive fusion before the segmentation head.

How exactly the modulation code works?

The modulation code stores the instance-specific geometric information of the object, such as pose, scale, rotation, and spatial layout. DMNet projects this high-dimensional modulation code into a compact 2D space using a learnable matrix. This 2D representation becomes the geometric pointer p, which allows the model to control geometry separately from semantic identity. By transforming this pointer and forcing the reconstructed feature map to transform consistently, DMNet learns a modulation space where geometry changes are predictable and do not corrupt semantic matching.

Fig. 3a modulation module — **Modulation Code Design** The modulation code is projected into a 2D geometric pointer, where controlled transformations encourage geometry-aware reconstruction.

Results

Performance and Ablations

DMNet consistently improves few-shot segmentation performance by reducing geometric mismatch between support and query features. Instead of only reporting numbers in a table, the results below highlight the main performance trends visually.

Best Overall

DMNet

75.6

mIoU on PASCAL-5ⁱ, 1-shot

Main Gain

Geometry-Decoupled Matching

+2.1

improvement over the strongest baseline

Robustness

Better under pose variation

✓

semantic matching is less disrupted by object geometry

Qualitative segmentation results — **Qualitative results.** DMNet produces cleaner masks by preserving semantic consistency while reducing geometry-induced matching errors between support and query objects.

Every component earns its place

Ablation Comparison (mIoU)

            DMNet (Full)
            75.6
          

w/o Spatially Adaptive Fusion 74.7

w/o Cyclic Refinement 74.4

w/o Disentangled Autoencoder 73.8

w/o Modulation Code 72.9

Citation

BibTeX

@inproceedings{biswas2026dmnet,
  title     = {Mitigating Pose-Scale Discrepancy Bias and Reforming
               Multi-Support Reasoning for Few-Shot Semantic Segmentation},
  author    = {Biswas, Shreya and Yin, Zhaozheng},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}