Patch-Level Consistency Regularization in Self-Supervised Transfer Learning for Fine-Grained Image Recognition
Patch-Level Consistency Regularization in Self-Supervised Transfer Learning for Fine-Grained Image Recognition
Blog Article
Fine-grained image recognition aims to classify fine subcategories belonging to the same parent category, such as vehicle model or bird species classification.This is an inherently challenging task because a classifier must capture subtle interclass differences under large intraclass variances.Most previous approaches are based on supervised learning, which requires a large-scale labeled dataset.However, such large-scale annotated datasets for fine-grained image recognition are difficult to collect because they generally require domain expertise during the labeling process.In this study, we propose a self-supervised transfer learning method based on Vision Transformer (ViT) to learn finer representations without human annotations.
Interestingly, it is observed that existing self-supervised learning methods using ViT (e.g., DINO) show poor patch-level semantic consistency, which may be detrimental to learning finer representations.Motivated by cent dyyni this observation, we propose a consistency loss function that encourages patch embeddings of the overlapping area between two augmented views to be similar to each other during self-supervised learning on fine-grained datasets.In addition, we explore effective transfer learning strategies to fully leverage existing self-supervised models trained on large-scale labeled datasets.
Contrary to the previous literature, our findings indicate that training only the last block of ViT is effective for self-supervised transfer learning.We demonstrate the effectiveness of our proposed approach through extensive experiments using six fine-grained image classification benchmark datasets, virginia mill works tobacco road acacia including FGVC Aircraft, CUB-200-2011, Food-101, Oxford 102 Flowers, Stanford Cars, and Stanford Dogs.Under the linear evaluation protocol, our method achieves an average accuracy of 78.5%, outperforming the existing transfer learning method, which yields 77.2%.