ProxyFusion: Face Feature Aggregation Through Sparse Experts

University at Buffalo
NeurIPS 2024

Our method addresses key challenges in face feature aggregation, including real-time inference, cross-distribution matching, and compatibility with legacy feature stores.

Abstract

Face feature fusion is indispensable for robust face recognition, particularly in scenarios involving long-range, low-resolution media (unconstrained environments) where not all frames or features are equally informative. Existing methods often rely on large intermediate feature maps or face metadata information, making them incompatible with legacy biometric template databases that store pre-computed features. Additionally, real-time inference and generalization to large probe sets remains challenging. To address these limitations, we introduce a linear time O (N ) proxy based sparse expert selection and pooling approach for context driven feature-set attention. Our approach is order invariant on the feature-set, generalizes to large sets, is compatible with legacy template stores, and utilizes significantly less parameters making it suitable real-time inference and edge use- cases. Through qualitative experiments, we demonstrate that ProxyFusion learns discriminative information for importance weighting of face features without relying on intermediate features. Quantitative evaluations on challenging low- resolution face verification datasets such as IARPA BTS3.1 and DroneSURF show the superiority of ProxyFusion in unconstrained long-range face recognition setting. Code and pretrained models will be released upon acceptance. Our code and pretrained models are available at: https://github.com/bhavinjawade/ ProxyFusion.

architecture

An overview of our proposed ProxyFusion Approach. Post feature extraction, our method is divided two end-to-end trainable stages: (i) Expert Selection and (ii) Sparse Expert Network Feature Aggregation. The Expert Selection module takes the {fi}N i=1 and returns the indices of expert networks based on proxy relevancy scores. Next, the selected expert networks compute set-centers conditioned on distribution and aligned proxy. These set-centers attend over the input feature set to compute aggregation weights.

Visualization of Learned Weights

Visualization

Visualizations of learned weights on BTS3.1 dataset’s gallery and probe set. Images on the top are from high quality gallery, and images on the bottom are from low resolution long-range probes. Faces are sorted based on ProxyFusion attention weights from low to high. We present these weights for each of the selected expert.

BibTeX

@article{jawade2024proxyfusion,
  title={ProxyFusion: Face Feature Aggregation Through Sparse Experts},
  author={Jawade, Bhavin and Stone, Alexander and Mohan, Deen Dayal and Wang, Xiao and Setlur, Srirangaraj and Govindaraju, Venu},
  journal={NeurIPS},
  year={2024}
}