Abstract
Cross-domain few-shot learning, which aims to solve the problem of domain gap in few-shot learning, has recently received more and more attention. Specifically, when there are great differences between the source domain and the target domain involved in few-shot learning, the performance will fall off a cliff and it is even difficult to train. Therefore, this paper explores a simple, effective and novel method to deal with domain gaps. Firstly, the pre-trained model is obtained by using the labeled data in the source domain. Next, the two-stage adaptive training mainly consists of unlabeled data in the target domain, pseudo-unlabeled data and labeled data in the source domain as the third-stream input, so that the network can gradually adapt to the data in target domain and mitigate the adverse effects caused by the domain gap. Finally, the proposed network can be quickly applied to the tasks to be solved. Through the observation of experimental results, the designed approach can achieve better performance than the existing comparison methods on the standard benchmark of cross-domain few-shot learning. Further analysis reveals the tradeoff between using data in source domain and target domain for cross-domain few-shot learning.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Avoid common mistakes on your manuscript.
1 Introduction
Deep learning has achieved great success in visual recognition, which is largely attributed to the deep convolutional networks trained on large amounts of data and labels [1]. Deep neural networks, which require a large number of samples for each category, have matched or reached the level of human visual recognition. And their generalization ability largely depends on the size and variations of the amount of data [2]. Then, when faced with a much smaller quantity of data, the few-shot learning (FSL) emerges with the aim of quickly identifying a novel category according to the information provided by few samples. In the recent works of FSL [3,4,5], the network is trained by using the base dataset (auxiliary set) with labeled data, so that it can quickly learn and adapt to novel set of data, and each novel class has a small number of support samples for evaluating query images. The intersection of base classes and new classes is empty set. However, they share the same domain space [6,7,8]. For realistic scenarios, it is difficult and impossible to collect a sufficient amount of target data due to the large distribution difference between source (base data) and target (novel data) domains in most cases, such as medical diagnosis, satellite imagery, and accident monitoring. Therefore, the cross-domain few-shot learning (CD-FSL) [2] is proposed for this purpose.
Generally, in some cases, human beings can distinguish novel categories through information of a few samples, and this plausible explanation is that new categories have high degree of similarity to prior knowledge and can be predicted their variation tendencies. If, however, these new categories are too different from previous experiences and completely diverse from previous examples, it is a challenging task for human beings to identify them. In the same way, with regard to deep neural networks, it is also greatly tough to solve the multifarious few-shot tasks that exist in the real life. With the introduction of benchmark about the broader study of the cross-domain few-shot learning (BSCD-FSL), CD-FSL has attracted more and more attention [9,10,11]. The benchmark [2] indicates that some large-scale natural image datasets are used as data in source domain, such as ImageNet. The data in target domain include medical [12], satellite [13], plant pathology [14] and other fields images [15]. Furthermore, due to the lack of labels in the target domain, it is not feasible or possible to follow the meta-learning paradigm, which is a mainstream in FSL. Moreover, meta-learning based methods do not perform well in the learning and fine-tuning of CD-FSL. By contrast, transfer learning has better performance, and the relative gain is greater with the increase of the number of labeled support samples. Therefore, as the recent researches on CD-FSL, the paradigm of transfer learning is used to carry out adaptive training with novel data in target domain before stage of fine-tuning and testing.
In the face of various domains, there are a large amount of unlabeled data in the target domain, which can be used as a specific representation of the learning target domain. However, it is obviously not satisfactory to learn unsupervised only by the amount of data in the target domain. As proved in [16], under the setting of FSL, the effect of using unsupervised learning alone is difficult to be better than that of simple transfer learning, that is, representations of supervised learning are performed on the source data first. Moreover, the combination of supervised learning and unsupervised learning is beneficial for learning a transferable representation [17], which is suitable for CD-FSL. It is worth noting that the labeled data in the source domain can be used as pseudo-unlabeled data for unsupervised learning, which also helps to learn the representation of diverse domains.
In this paper, a strategy is proposed to solve this puzzle, which is depicted in Fig. 1. It mainly consists of three steps, namely the representation learning on the data in the source domain (pre-training), the transferable representation learning on data in the target domain and the source domain to adapt the target domain (adaptive training), and construction of few-shot tasks on the target data to fine-tune and test the network. Concretely, the representation of features is obtained by supervised training on the source data in advance. When it is applied to the target data, there will be useless labels and domain gaps. Therefore, in the adaptive training stage, unsupervised learning is used for feature extraction of unlabeled data, which is composed of target data and a portion of source data, so that the feature extractor pre-trained on the source data can be guided to transfer to the target domain to alleviate the influence of domain gaps. In this presented adaptive learning, there are calculation of similarity with labels and comparison of similarities and differences between the same class, which makes the representation learner obtain rich information related to the target domain under the guidance of the transcendental basis from the source domain.
In summary, the contributions of this paper are itemized as follows:
-
In view of the domain gaps in cross-domain few-shot learning, a novel three-stream contrastive adaptive network is proposed.
-
The network adopted a two-stage method combining pre-training and adaptive training, and in the second stage, it designed a hybrid supervised manner using three-stream data to learn feature representations more suitable for the target domain.
-
The network trained in this paper can quickly adapt to the tasks to be solved, and the optimal results obtained on the four benchmark datasets can explain the success of the proposed adaptive network.
The remainder of this paper is organized as follows. Section 2 introduces some related work on few-shot learning, domain adaptation, cross-domain few-shot learning and few-shot learning with unlabeled data. Details about each component of the proposed approach are provided in section 3. Next, the detailed experimental results and analysis are introduced in section 4. And the paper is concluded in section 5.
Problem setup. A large amount of labeled data in source domain and some unlabeled data in target domain are depicted on the left. During the pre-training phase (top right), the learner has access to all labeled source datasets for pre-training its representation. The learner then conducts adaptive retraining on unlabeled data in target domain, pseudo-unlabeled data and labeled data in source domain, respectively, to learn the representation of target dataset (middle right). In the end, the learner can rapidly learn and adapt to the few-shot tasks of the target datasets in the fine-tuning and testing phase (bottom right)
2 Related Works
Few-shot Learning (FSL) FSL techniques generally refer to models trained by base dataset that can generalize to novel dataset and complete classification tasks with only a few support samples. The existing methods of FSL can be roughly divided into two groups: metalearning and transfer learning. The former is designed to learn a meta-learner that can be adapted quickly with a few novel samples [18]. As one of the mainstream, metric-learning (i.e., learn to compare) based methods [19,20,21], which directly compares the similarity or distance between the query images and the support classes via the training mechanism of episodes. In addition, MAML [22] is a popular optimization-based approach, the core idea of which is to train the initial parameters of the model by involving second-order gradients, so that this model can quickly adapt to new tasks with only one or a few gradient steps. The latter focuses on transfer learning [23,24,25], which are also simple yet effective FSL methods. First, a pre-trained network can be learned via usage of all the base data. Then, the part of parameters of the network will be fine-tuned by using a few support samples in the novel data. Finally, the fine-tuned network can be transferred to the novel FSL tasks.
Domain Adaptation Domain adaptation [26,27,28] is to learn knowledge from a source domain with abundant labels and then transfer it to a target domain consisting solely of unlabeled data. Different from the FSL research, as shown in Fig. 2, a key point of domain adaptation is that the source domain shares the same label space as the target domain. In addition, the target data with labels is sufficient in general domain adaptation. This is different from the FSL, which requires that novel classes must be classified using less labeled data in the target domain.
Cross-domain Few-shot Learning (CD-FSL) In the CD-FSL task, a more challenging and practical scenario, which has a large domain gap between the source and target domains, that is, the data distribution difference is obvious. The recent work [29,30,31] has proved that existing state-of-the-art approaches of FSL fail to transfer the knowledge learned from source domain well into the target domain owing to the large domain gap. Similarly, the domain adaptation based approaches cannot be directly applied to CD-FSL because the labels of the source domain data are inconsistent with those of the target domain. For this task, a new benchmark is proposed in BSCD-FSL [2], which also has carried out the more broader research. As a typical case of the meta-learning paradigm, FWT [11] simulates the classification task of constructing the target domain with new categories on a single or multiple datasets in source domains to train the feature transformation layer so as to achieve the classification adapted to the target domain. Inspiredly, Meta-Baseline [7] validates that there may be an objective difference in the meta-learning framework, that is, a meta-learning model that generalizes better on unseen tasks of the base class may perform worse on tasks of new classes. So recently the fine-tuning paradigm has been gradually applied, and the data of the target domain are more or less used for training, such as meta-FDMixup [9], STARTUP [16] and Dynamic Distillation [32]. These algorithms train a pre-trained network using cross-entropy on the labeled source data at the first. Different from most methods, meta-FDMixup then uses a meta-training approach, which constructs episodes with a small amount of data in target domain, and conducts mixed training with source data episodes to learn to unlock domain-irrelevant and domain-specific features. However, both algorithms, STARTUP and Dynamic Distillation, use the pre-trained network later to obtain the soft labels of the samples in target domain. With a slight difference, STARTUP applies cross-entropy loss to the source data and KL loss and self-supervised loss (e.g., SimCLR) to the target data. While the Dynamic Distillation applies cross-entropy loss to the source data. Moreover, the target data is strongly and weakly enhanced to obtain soft labels, which are trained with KL loss. Furthermore, multiple combinations of Mixed-Supervised Learning (MSL) methods are introduced in [33], which explores the performance of supervised learning, unsupervised learning and MSL on CD-FSL in terms of domain similarity and few-shot difficulty through extensive experiments.
Few-shot Learning with Unlabeled Data How to process the unlabeled data in the target domain is a critical stage to bridge the domain gap. The unsupervised approaches [34,35,36], which have been revived since 2020, fit precisely into this scenario. Early research of unsupervised learning focused on semantic space learning [37], aggregation of similar categories [38,39,40,41,42], information restoration [43] and so on. Contrastive learning [44,45,46], as one of the important branches of unsupervised learning, has been widely explored in recent years and compared with the downstream tasks of supervised learning. However, when there is a large domain gap between the source domain and the target domain, the method of using unsupervised learning to ease or eliminate this domain gap remains to be explored.
3 Methodology
In this section, the problem formulation of FSL is given first. And then introduce its specific details on CD-FSL. After that, the designed method in this paper is proposed.
3.1 Problem definition
In the setting of FSL, a base dataset \({\mathcal {D}}_{base}\) with a large number of labels is usually given for learning the knowledge and transferring it. And the tasks to be solved are to learn new concepts with a small number of labeled samples from a novel dataset \({\mathcal {D}}_{novel}\). Concretely, a task of few-shot consists of two parts, including the labeled support set \({\mathcal {S}}\) and the unlabeled query set \({\mathcal {Q}}\). In particular, \({\mathcal {S}}\) and \({\mathcal {Q}}\) share a consistent label space. Actually, the conception of "few-shot" refers to \({\mathcal {S}}\), which includes \({\mathcal {N}}\) categories, and there are \({\mathcal {K}}\) labeled samples in each category. This kind of task is called as “\({\mathcal {N}}\)-way \({\mathcal {K}}\)-shot."
Different from the setting of classical FSL, for CD-FSL, the dataset \({\mathcal {D}}_{target}\) in the target domain is extremely different from the dataset \({\mathcal {D}}_{source}\) in the source domain. However, the goal of learning has not changed, that is, the network needs to learn knowledge in the source domain and transfer knowledge with a small quantity of labeled data in the target domain \({\mathcal {D}}_{{\mathcal {S}}_{target}}\), so that it can be applied to few-shot tasks in the target domain. It is worth noting that the unlabeled data in the target domain \({\mathcal {D}}_{{\mathcal {U}}_{target}}\) are abundant and informative, which can be effectively learned to extract the representations that can be generalized to few-shot tasks in the target domain.
For the novel setting of the proposed algorithm, the \({\mathcal {D}}_{source}\) is not only used for pre-training learning, but also a part of data is randomly extracted from it to construct a pseudo-unlabeled dataset \({\mathcal {D}}_{\mathcal{P}\mathcal{U}_{source}}\) for guiding unsupervised learning. What is worth noting is that the \({\mathcal {D}}_{source}\), \({\mathcal {D}}_{\mathcal{P}\mathcal{U}_{source}}\) and \({\mathcal {D}}_{{\mathcal {U}}_{target}}\) are used together in adaptive training to learn an embedding suitable for evaluation of few-shot learning in the target domain.
3.2 Proposed Method
Overview of the TsCANet. It consists of two stages. During the pre-training stage, the feature extractor \(f_{s}\) and classifier \(\textit{c}_\textit{s}\) are pre-trained mainly with data in source domain. The unlabeled data in the target domain, pseudo-unlabeled data and labeled data in the source domain are used to re-train the whole network, especially with the pre-trained \(f_{s}\), whose parameters are shared with \(f_{t}\). In addition, the projector \(\textit{g}_\textit{t}\) and \(\textit{g}_\textit{s}\) are introduced into contrastive learning for unlabeled data
The proposed TsCANet is depicted in Fig 3. This is a two-stage training network, including pre-training and adaptive training. In pre-training stage, the feature extractor \(f_{s}\) is used for the feature representations of the input images in source domain, while classifier \(\textit{c}_\textit{s}\) maps the obtained representations to the probability distribution on the source categories, which is a simple fully connected layer. The whole source dataset \({\mathcal {D}}_{source}\) is used at the pre-training stage by using standard cross-entropy loss as below,
where \((x_{i},y_{i})\) is the data with its labels, \({\mathcal {L}}_{s}\) is the cross-entropy loss function, \(f_{\theta }\) and \(c_{\omega }\) are the feature extractor and classifier with their parameters, respectively.
In the second-order adaptive training, the main core of the strategy is to attain the representations suitable for the target domain. Therefore, the pre-trained feature extractor and its parameters are copied to get \(f_{t}\), which is used to extract the representations of unlabeled images in target domain. And the representations of pseudo-unlabeled images and labeled images in the source domain are also obtained by using pre-trained \(f_{s}\). Different from supervised learning in pre-training, the contrastive losses of \(L_{un}\), \(L_{p-un}\) are calculated after the input unlabeled images are characterized by the projector \(\textit{g}_\textit{t}\) and \(\textit{g}_\textit{s}\), respectively. While for training on these unlabeled data, specifically, N samples are randomly sampled from a mini-batch and augment them one by one to attain 2N data. In detail, the architecture is depicted in Fig 4. In this way, each input image obtains two enhanced data corresponding to it. For the first augmented sample \(x_{i}'\), the \(f(\cdot )\) outputs a representation f1=\(f(x_{i}')\) and the \(g(\cdot )\) outputs a projection \(g_{1}\)=g(f1). The network structure of \(g(\cdot )\) is depicted in Fig 4, which is same as the predictor \(p(\cdot )\). And the \(g_{2}\) is obtained from the second augmented sample \(x_{i}''\). The \(p_{2}\) and the \(g_{1}\) can be obtained in like manner. Then, the losses of positive pairs are calculated with the \({\mathcal {L}}_{p_{1}g_{2}}\) and the \({\mathcal {L}}_{p_{2}g_{1}}\), which \(l_{2}\)-normalizes the \(p_{1}\), \(p_{2}\), \(g_{1}\) and \(g_{2}\):
With the calculations of the above-mentioned positive pairs, the unlabeled data can be trained to attain their losses, that is, \({\mathcal {L}}_{un}\) and \({\mathcal {L}}_{p-un}\), which, respectively, correspond to unlabeled data in the target domain and pseudo-unlabeled data in the source domain. Then, the total loss function of the TsCANet is calculated as follows,
where the \(\lambda\) is a hyper-parameter, the total loss function is used to update the parameters of the TsCANet at the adaptive training stage. It is worth noting that this total loss is only fed back from the branch of unsupervised learning in target domain, that is, the parameters of feature extractor \(f_{t}\) and projector \(g_{t}\) are updated and shared with the remaining two branches. Only the \(f_{t}\) is left and freezed its parameters after the adaptive training as a feature extractor for final testing. Then train a linear classifier on the support set \({\mathcal {D}}_{{\mathcal {S}}_{target}}\) and evaluate on the remaining query set \({\mathcal {D}}_{{\mathcal {Q}}_{target}}\), all of which are from \({\mathcal {D}}_{target}\). At last, the pseudo-code of TsCANet is in Algorithm 1.
4 Experiments
4.1 Experimental setup
Dataset. The evaluation protocol follows the previously proposed BSCD-FSL benchmark. The source dataset is the mini-ImageNet [19], which is often used as an auxiliary set to train networks in FSL research. And, of course, there is tiered-ImageNet [47], which is a larger dataset than mini-ImageNet, with a rich variety of items. There are four target domain datasets in the benchmark, which are from very different domains than the source dataset: CropDiseases [14](diseases of different plants), EuroSAT [13] (predictions of land use), ISIC [48] (melanoma conditions from skin lesions) and ChestX [12] (diagnosis of chest X-rays).
In the concrete steps, a classification network is trained firstly on the source domain dataset mini-ImageNet, and the same process is performed on the larger tiered-ImageNet dataset. Then, the pseudo-unlabeled dataset \({\mathcal {D}}_{\mathcal{P}\mathcal{U}_{source}}\) is constructed by randomly sampling 10\(\%\) of the data from the source dataset \({\mathcal {D}}_{source}\). In addition, 20\(\%\) of the data was sampled randomly from each of the target dataset to build unlabeled dataset \({\mathcal {D}}_{{\mathcal {U}}_{target}}\). The remaining images in the target domain were used for test, where 5-way 1-shot and 5-way 5-shot (the support set consists of 5 classes and 1 or 5 examples per class) were used for evaluation results. Similarly, the results of the evaluation on the tiered-ImageNet dataset are also reported.
Model setting The ResNet-10 [49] and fully connected (FC) layer are used as the feature extractor and classifier, respectively. In the pre-training phase, the classification network is trained on either mini-ImageNet or tiered-ImageNet datasets only, where the batch size is set to 64, weight decay to 1e-4, and learning rate to 0.1. The optimizer selects SGD, and the transformation of data augmentation includes random crop, color jitter, Gaussian blur, random gray scale, random flip, etc.
During adaptive training stage, the third-stream data \({\mathcal {D}}_{source}\), \({\mathcal {D}}_{\mathcal{P}\mathcal{U}_{source}}\) and \({\mathcal {D}}_{{\mathcal {U}}_{target}}\) are input to obtain the corresponding losses \({\mathcal {L}}_{s}\), \({\mathcal {L}}_{p-un}\) and \({\mathcal {L}}_{un}\), and then the total loss \({\mathcal {L}}\) can be calculated, where the hyper-parameter \(\lambda\) is kept as 0.9 and the parameters such as batch size are kept same as described above. For the fine-tuning and testing phase, that is, the evaluation of FSL tasks in the target domain, a FC classifier is learned using the support set \({\mathcal {D}}_{{\mathcal {S}}_{target}}\) and evaluated on the query set \({\mathcal {D}}_{{\mathcal {Q}}_{target}}\). The mean of the 600 FSL tasks in target domain within the 95\(\%\) confidence interval was used as the final evaluation.
4.2 Evaluation
Main results The main experimental results, mini-ImageNet \(\rightarrow\) target datasets, are presented in Table 1. First, a comparison is made with the methods reported in [2], which includes the most state-of-the-art methods, as well as the technique of feature-wise transformation used in [11]. In addition, two methods of the same experimental setup are also mentioned in another literature [32]. These techniques achieve few-shot tasks of image classification by the mechanism of meta-training only with base datasets in the source domain, without using the unlabeled data in the novel target domain. Then, the network is adapted to the task in the target domain, which are composed of the small number of labeled support samples and some unlabeled samples. The performance of these methods on extremely different target domains is acceptable, which proves the strong generalization ability of the network. However, it is clear that the performance of such methods is lower than that of methods based on the paradigm of transfer learning. In addition, combining the classification accuracy of these methods on the ChestX and ISIC datasets, it can be concluded that if unlabeled data in target domain are not used for training, the classification results will be limited to a large domain gap.
In the remaining methods, “Transfer” refers to training the full classification model on the base data set using cross-entropy losses, and then fine-tuning by support samples to achieve the task in target domain. “SimCLR(b)” and “SimCLR” represent the method of contrastive learning for unlabeled data in source domain and target domain, respectively. For “Transfer+SimCLR”, the supervised cross-entropy loss function is used to train the base dataset, and the unsupervised contrastive loss function is used to train the unlabeled dataset to attain the network. In “STARTUP” [16], cross-entropy loss is calculated on the source dataset for pre-training to obtain the teacher network. And then cross-entropy loss on the source data, KL-divergence loss on the unlabeled data, and unsupervised contrastive loss on the unlabeled images are used in the two-stage training for student network. A similar method, “Dynamic distillation” [32], also follows the approach of knowledge distillation and uses the cross-entropy loss to train the teacher network on the source domain. And then, the KL-divergence loss is calculated by the unlabeled dataset within the prediction results of strong and weak enhancement(obtained by the teacher network), combined with the cross-entropy loss calculated in the source domain as total loss on the student network. “MSL” [33], using two-stage training, combines the cross-entropy loss on the source domain and the unsupervised contrastive loss on the target domain to get the final network. These methods that follow the paradigm of transfer learning gradually use the combination of data in source domain and target domain, and the indicators are also improved accordingly.
Compared with these methods, the method proposed in this paper not only has the ability to learn and acquire features in the source domain, but also constructs the third-stream data to make the comparative adaptative training more adequate, so as to obtain better representations on the target domain, which is conducive to the tasks. This is confirmed by significant improvement on the four general datasets of CD-FSL. For the 5-shot index, the proposed method exceeds the other methods by more than 1.12%, 1.82%, 1.74% and 0.66% on ChestX, ISIC, EuroSAT and CropDisease, respectively. Especially for 1-shot metrics, the proposed method outperforms the other methods by more than 0.91%, 3.13%, 1.90% and 1.11% on ChestX, ISIC, EuroSAT and CropDisease, respectively. From the experimental results, compared with the above methods which only use unsupervised learning in the target domain, the proposed method constructs the three-stream data and designs the contrastive adaptive training in the second stage, which makes the network gradually adapt to the target domain. And through the clever use of true and pseudo-unlabeled data, the richer feature representations in the target domain can be obtained, that is, the influence caused by large domain gaps can be alleviated.
Results with tiered-ImageNet As a subset of the ImageNet dataset, tiered-ImageNet is roughly classified into 34 categories, which can be subdivided into a total of 608 classes. The dataset was divided into a training set of 20 categories (351 classes), a validation set of 6 categories (97 classes), and a testing set of 8 categories (160 classes). Table 2 shows the performance comparison of datasets with tiered-ImageNet as the source domain. Similarly, for the 5-shot index, the proposed method exceeds the other methods by more than 1.21%, 1.75%, 1.85% and 1.02% on ChestX, ISIC, EuroSAT and CropDisease, respectively. Specially, for 1-shot metrics, the proposed method outperforms the other methods by more than 0.89%, 2.36%, 2.12% and 1.48% on ChestX, ISIC, EuroSAT and CropDisease, respectively. Similarly, such superior performance is attributed to two-stage three-stream contrastive adaptive training. Excellent representations are obtained in the supervised training of more data in source domain. And with the contrastive adaptive training of third-stream data in the second stage, the model is gradually adapted to the target domain. This also indicates that for FSL and CD-FSL, the size of mini-ImageNet is sufficient as the base (or source) dataset. Meanwhile, compared with the results obtained in Table 1, for CD-FSL, the size of the source domain dataset does not have a positive effect on the final results after a certain extent, while the size of mini-ImageNet is sufficient to serve as the base (or source) dataset.
4.3 Analysis
LDA plot of 7 classes from target domain prior to and after TsCANet. a Original feature of ChestX. d Original feature of ISIC. g Original feature of EuroSAT. j Original feature of CropDisease.b, e, h, and k in the middle columns represent the features obtained from the pre-training model trained only by data in source domain mini-ImageNet, respectively. c, f, i, and l in the right columns represent the features attained after the TsCANet, respectively. “\(\cdot\)” represents the characteristics of each sample, while “\(*\)” represents the prototype vector of each class
The effect of TsCANet In order to intuitively understand how TsCANet helps to learn a better representation of the target domain, features of samples in target dataset are extracted, and the average features of each class are calculated, and the distributions are depicted after linear discriminant analysis(LDA). Figure 5 depicts the LDA plots of 7 representative categories with 100 samples for each category from ChestX, ISIC, EuroSAT and CropDisease datasets, respectively. In addition, the cluster center of each category is marked with “\(*\).” In Fig 5, the characteristics of some samples in ChestX, ISIC, EuroSAT and CropDisease are shown from left to right. (a), (d), (g) and (j) show the distribution of original features; (b), (e), (h) and (k) show the distribution of features extracted from network pre-trained in source domain(most transfer-based methods); and (c), (h), (i) and (l) show the distribution of features extracted by TsCANet. Moreover, Fig 5 shows that the proposed algorithm enables the network perform better cluster learning for the data in the target domain (greater separability between categories). In the same way, it is also proved that the proposed algorithm can obtain better classification indexes, which is achieved without using any label information of data in the target domain during training.
The influence of using unlabeled data Some further tests are performed on the amount of unlabeled data used in the proposed TsCANet. Figure 6 depicts the variation of the 5-way 5-shot classification index with the amount of data used in the target domain. It can be seen from Fig 6 that the accuracy of classification shows an upward trend with the increase in data usage in the target domain for all the four datasets. In particular, when the amount of unlabeled data is zero, that is, no target domain data are used to participate in the training, the accuracy of classification will be significantly reduced. This is also an ablation experiment on TsCANet. Furthermore, it is proved that the unlabeled data are still helpful for learning better representation in the target domain. It also confirms that the processing of more unlabeled data is a very promising direction for future research.
The impact of the ways and shots The performance of TsCANet when different numbers of ways and shots are adopted during the testing process. The model was trained using 20% of the target domain data, such as consistent description of experimental details. The 5-way accuracy of the number of shots are reported in Table 3. It can be clearly seen that as the number of shots increases, the accuracy also improves. And it shows that the increase of the number of support samples is beneficial to the challenging task of cross-domain few-shot image classification. In particular, the results of ISIC, EuroSAT and CropDisease datasets show that the accuracy of the results with more than one shot is significantly improved by at least 10% compared with the results with 1-shot, which are 10.63, 14.00 and 12.31, respectively.
In addition, the effect of the number of ways on the accuracy is depicted in Fig 7. It is evident that the accuracy of classification tends to decrease as the number of ways increases, especially on ChestX and ISIC datasets, which reflects the great difficulty of CD-FSL. However, according to the accuracy variation at 5-shot, the performance on EuroSAT and CropDisease datasets only slightly decreases with the increase of ways, which demonstrates the superiority and effectiveness of TsCANet.
The usage of negative sample pairs. If negative pairs of samples are used as the inputs to the unsupervised branches, it is necessary to select a positive pair and take the other 2\((N-1)\) data as the negative samples. Then, one of the samples in the positive pair is combined with the negative samples to become negative pairs, so as to learn the difference between itself and others and obtain the clusters. Then, the loss can be calculated as follows,
where \(z_{i}\) and \(z_{j}\) refer to the feature obtained by a positive pair \(x_{i}'\) and \(x_{i}''\) after the feature extractor and projector. And \(\langle \textsf {{a}},\textsf {{b}}\rangle\)=\(\textsf {{a}}^{\mathrm{{T}}}\textsf {{b}}/\Vert \textsf {{a}}\Vert \Vert \textsf {{b}}\Vert\) denote the dot product between \(l_{2}\)-normalized a and b(i.e., cosine similarity). In addition, \(\tau\) denotes a temperature parameter. And the training process of negative pairs is shown in Fig 8.
Table 4 lists the cross-domain few-shot classification results obtained by using negative pairs as input to the unsupervised learning branches. Regardless of the source domain datasets are mini-ImageNet and tiered-ImageNet, the TsCANet using pairs of negative samples is slightly more advantageous on EuroSAT and CropDisease, although its performance on ChestX and ISIC is mediocre. This may be due to the presence of similar categories in both ChestX and ISIC, so that a small number of support samples do not have good ability of representations. In addition, this reflects the effectiveness of TsCANet, which can be embedded and plug-and-play for different ways of unsupervised learning.
5 Conclusion
In this paper, a novel algorithm is proposed for CD-FSL, which uses a two-stage training approach and performs hybrid adaptive learning with three-stream data. Experiments show that the proposed algorithm achieves the most advanced results on four datasets. Furthermore, the influence of the usage of unlabeled data in the target domain on the accuracy of CD-FSL is also given. Finally, this method has certain guiding significance for future cross-domain few-shot learning. The novel unsupervised learning methods and the designs of advanced classifiers can be improved in future based on the method proposed in this paper. Thus, the class-specific representations of the target domain can be better learned, and more discriminative information can be obtained through limited samples.
Data availability
All the data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Hassantabar S, Terway P, Jha NK (2023) TUTOR: training neural networks using decision rules as model priors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 42(2):483–496. http://doi.org/10.1109/TCAD.2022.3179245
Guo Y, Codella N, Karlinsky L, Codella JV, Smith JR, Saenko K, Rosing T, Feris R (2020) A broader study of cross-domain few-shot learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) 16th European Conference of Computer Vision, ECCV, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXVII. Lecture Notes in Computer Science, vol. 12372, pp. 124–141. http://doi.org/10.1007/978-3-030-58583-9_8
Chen H, Li L, Hu F, Lyu F, Zhao L, Huang K, Feng W, Xia Z (2023) Multi-semantic hypergraph neural network for effective few-shot learning. Pattern Recognition 142:109677. http://doi.org/10.1016/j.patcog.2023.109677
Shi B, Li W, Huo J, Zhu P, Wang L, Gao Y (2023) Global- and local-aware feature augmentation with semantic orthogonality for few-shot image classification. Pattern Recognition 142:109702. http://doi.org/10.1016/j.patcog.2023.109702
Xie J, Long F, Lv J, Wang Q, Li P (2022) Joint distribution matters: Deep brownian distance covariance for few-shot classification. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, June 18-24, 2022, pp. 7962–7971. http://doi.org/10.1109/CVPR52688.2022.00781
Luo X, Wu H, Zhang J, Gao L, Xu J, Song J (2023) A closer look at few-shot classification again. In: International Conference on Machine Learning, ICML, 23-29 July 2023, Honolulu, Hawaii, USA. Proceedings of Machine Learning Research, vol. 202, pp. 23103–23123. http://proceedings.mlr.press/v202/luo23e.html
Chen Y, Liu Z, Xu H, Darrell T, Wang X (2021) Meta-baseline: Exploring simple meta-learning for few-shot learning. In: International Conference on Computer Vision, ICCV, Montreal, QC, Canada, October 10-17, 2021, pp. 9042–9051. http://doi.org/10.1109/ICCV48922.2021.00893
Chen W, Liu Y, Kira Z, Wang YF, Huang J (2019) A closer look at few-shot classification. In: 7th International Conference on Learning Representations, ICLR, New Orleans, LA, USA, May 6-9, 2019. http://openreview.net/forum?id=HkxLXnAcFQ
Fu Y, Fu Y, Jiang Y (2021) Meta-fdmixup: Cross-domain few-shot learning guided by labeled target data. In: MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, pp. 5326–5334. http://doi.org/10.1145/3474085.3475655
Li P, Gong S, Wang C, Fu Y (2022) Ranking distance calibration for cross-domain few-shot learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, June 18-24, 2022, pp. 9089–9098. http://doi.org/10.1109/CVPR52688.2022.00889
Tseng H, Lee H, Huang J, Yang M (2020) Cross-domain few-shot classification via learned feature-wise transformation. In: 8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia, April 26-30, 2020. http://openreview.net/forum?id=SJl5Np4tPr
Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM (2017) Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, July 21-26, 2017, pp. 3462–3471. http://doi.org/10.1109/CVPR.2017.369
Helber P, Bischke B, Dengel A, Borth D (2019) Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selection Topics in Applied Earth Observations and Remote Sensing 12(7):2217–2226. http://doi.org/10.1109/JSTARS.2019.2918242
Chug A, Bhatia A, Singh AP, Singh D (2023) A novel framework for image-based plant disease detection using hybrid deep learning approach. Soft Computing 27(18):13613–13638. http://doi.org/10.1007/s00500-022-07177-7
Veronica R, Halpern A, Dusza SW, Codella NCF (2019) The role of public challenges and data sets towards algorithm development, trust, and use in clinical practice. Seminars in Cutaneous Medicine and Surgery 38(1):38–42. http://doi.org/10.12788/j.sder.2019.013. PMID: 31051022
Phoo CP, Hariharan B (2021) Self-training for few-shot transfer across extreme ta skdifferences. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. http://openreview.net/forum?id=O3Y56aqpChA
Islam A, Chen C, Panda R, Karlinsky L, Radke RJ, Feris R (2021) A broad study on the transferability of visual representations with contrastive learning. In: IEEE/CVF International Conference on Computer Vision, ICCV, Montreal, QC, Canada, October 10-17, 2021, pp. 8825–8835. http://doi.org/10.1109/ICCV48922.2021.00872
Lee K, Maji S, Ravichandran A, Soatto S (2019) Meta-learning with differentiable convex optimization. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Long Beach, CA, USA, June 16-20, 2019, pp. 10657–10665. http://doi.org/10.1109/CVPR.2019.01091 . http://openaccess.thecvf.com/content_CVPR_2019/html Lee_MetaLearning_With_Differentiable_Convex_Optimization_CVPR_2019_paper.html
Vinyals O, Blundell C, Lillicrap T, Kavukcuoglu K, Wierstra D (2016) Matching networks for one shot learning. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, December 5-10, 2016, Barcelona, Spain, pp. 3630–3638. http://proceedings.neurips.cc/paper/2016/hash/90e1357833654983612fb05e3ec9148c-Abstract.html
Snell J, Swersky K, Zemel RS (2017) Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, December 4-9, 2017, Long Beach, CA, USA, pp. 4077–4087. http://proceedings.neurips.cc/paper/2017/hash/cb8da6767461f2812ae4290eac7cbc42-Abstract.html
Sung F, Yang Y, Zhang L, Xiang T, Torr PHS, Hospedales TM (2018) Learning to compare: Relation network for few-shot learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, June 18-22, 2018, pp. 1199–1208. http://doi.org/10.1109/CVPR.2018.00131 . http://openaccess.thecvf.com/content_cvpr_2018/html/Sung_Learning_to_Compare_CVPR_2018_paper.html
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, ICML, Sydney, NSW, Australia, 6-11 August 2017. Proceedings of Machine Learning Research, vol. 70, pp. 1126–1135. http://proceedings.mlr.press/v70/finn17a.html
Rajasegaran J, Khan S, Hayat M, Khan FS, Shah M (2021) Self-supervised knowledge distillation for few-shot learning. In: 32nd British Machine Vision Conference 2021, BMVC, Online, November 22-25, 2021, p. 179. http://www.bmvc2021-virtualconference.com/assets/papers/0820.pdf
Yang S, Liu L, Xu M (2021) Free lunch for few-shot learning: Distribution calibration. In: 9th International Conference on Learning Representations, ICLR, Virtual Event, Austria, May 3-7, 2021. http://openreview.net/forum?id=JWOiYxMG92s
Tian Y, Wang Y, Krishnan D, Tenenbaum JB, Isola P (2020) Rethinking few-shot image classification: A good embedding is all you need? In: Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIV, vol. 12359, pp. 266–282. http://doi.org/10.1007/978-3-030-58568-6_16
Chen Z, Wang C, Wu J, Deng C, Wang Y (2023) Deep convolutional transfer learning-based structural damage detection with domain adaptation. Applied Intelligence 53(5):5085–5099. http://doi.org/10.1007/s10489-022-03713-y
Karimian M, Beigy H (2023) Concept drift handling: A domain adaptation perspective. Expert System with Applications 224:119946. http://doi.org/10.1016/j.eswa.2023.119946
Kumar V, Patil H, Lal R, Chakraborty A (2023) Improving domain adaptation through class aware frequency transformation. International Journal of Computer Vision 131(11):2888–2907. http://doi.org/10.1007/s11263-023-01810-0
Zhang J, Song J, Gao L, Shen H (2022) Free-lunch for cross-domain few-shot learning: Style-aware episodic training with robust contrastive learning. In: MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pp. 2586–2594. http://doi.org/10.1145/3503161.3547835
Li W, Liu X, Bilen H (2022) Cross-domain few-shot learning with task-specific adapters. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, June 18-24, 2022, pp. 7151–7160. http://doi.org/10.1109/CVPR52688.2022.00702
Guan J, Zhang M, Lu Z (2020) Large-scale cross-domain few-shot learning. In: 15th Asian Conference on Computer Vision, ACCV, Kyoto, Japan, November 30 - December 4, 2020, Revised Selected Papers, Part III. Lecture Notes in Computer Science, vol. 12624, pp. 474–491. http://doi.org/10.1007/978-3-030-69535-4_29
Islam A, Chen CR, Panda R, Karlinsky L, Feris R, Radke RJ (2021) Dynamic distillation network for cross-domain few-shot recognition with unlabeled data. In: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, NeurIPS, December 6-14, 2021, Virtual, pp. 3584–3595. http://proceedings.neurips.cc/paper/2021/hash/1d6408264d31d453d556c60fe7d0459e-Abstract.html
Oh J, Kim S, Ho N, Kim J, Song H, Yun S (2022) Understanding cross-domain few-shot learning based on domain similarity and few-shot difficulty. In: Conference on Neural Information Processing Systems, NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/11b3ae28275461741026c46c0c786711-Abstract-Conference.html
Chen X, Fan H, Girshick RB, He K (2020) Improved baselines with momentum contrastive learning. arXiv arxiv:2003.04297
Chen T, Kornblith S, Norouzi M, Hinton GE (2020) A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, ICML, 13-18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. http://proceedings.mlr.press/v119/chen20j.html
Grill J, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BÁ, Guo Z, Azar MG, Piot B, Kavukcuoglu K, Munos R, Valko M (2020) Bootstrap your own latent - A new approach to self-supervised learning. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual. http://proceedings.neurips.cc/paper/2020/hash/f3ada80d5c4ee70142b17b8192b2958e-Abstract.html
Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) 14th European Conference of Computer Vision, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI. Lecture Notes in Computer Science, vol. 9910, pp. 69–84. http://doi.org/10.1007/978-3-319-46466-4_5
Borkowski P, Ciesielski K, Klopotek MA (2014) Unsupervised aggregation of categories for document labelling. In: Foundations of Intelligent Systems - 21st International Symposium, ISMIS, Roskilde, Denmark, June 25-27, 2014. Proceedings. Lecture Notes in Computer Science, vol. 8502, pp. 335–344. http://doi.org/10.1007/978-3-319-08326-1_34
Wang D, Li T, Deng P, Liu J, Huang W, Zhang F (2023) A generalized deep learning algorithm based on NMF for multi-view clustering. IEEE Transactions on Big Data 9(1):328–340. http://doi.org/10.1109/TBDATA.2022.3163584
Wang D, Li T, Deng P, Zhang F, Huang W, Zhang P, Liu J (2023) A generalized deep learning clustering algorithm based on non-negative matrix factorization. ACM Transactions on Knowledge Discovery from Data 17(7):99–19920. http://doi.org/10.1145/3584862
Wang D, Li T, Huang W, Luo Z, Deng P, Zhang P, Ma M (2023) A multi-view clustering algorithm based on deep semi-nmf. Information Fusion 99:101884. http://doi.org/10.1016/J.INFFUS.2023.101884
Wang D, Li T, Deng P, Luo Z, Zhang P, Liu K, Huang W (2024) Dnsrf: Deep network-based semi-nmf representation framework. ACM Transactions on Intelligent Systems and Technology. http://doi.org/10.1145/3670408
Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. In: 6th International Conference on Learning Representations, ICLR, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. http://openreview.net/forum?id=S1v4N2l0-
Chen X, He K (2021) Exploring simple siamese representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Virtual, June 19-25, 2021, pp. 15750–15758. http://doi.org/10.1109/CVPR46437.2021.01549 . http://openaccess.thecvf.com/content/CVPR2021/html/Chen_Exploring_Simple_Siamese_Representation_Learning_CVPR_2021_paper.html
Tian Y, Chen X, Ganguli S (2021) Understanding self-supervised learning dynamics without contrastive pairs. In: Proceedings of the 38th International Conference on Machine Learning, ICML, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 10268–10278. http://proceedings.mlr.press/v139/tian21a.html
Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: Self-supervised learning via redundancy reduction. In: Proceedings of the 38th International Conference on Machine Learning, ICML, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 12310–12320. http://proceedings.mlr.press/v139/zbontar21a.html
Ren M, Triantafillou E, Ravi S, Snell J, Swersky K, Tenenbaum JB, Larochelle H, Zemel RS (2018) Meta-learning for semi-supervised few-shot classification. In: 6th International Conference on Learning Representations, ICLR, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. http://openreview.net/forum?id=HJcSzz-CZ
Codella NCF, Rotemberg V, Tschandl P, Celebi ME, Dusza SW, Gutman DA, Helba B, Kalloo A, Liopyris K, Marchetti MA, Kittler H, Halpern A (2019) Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). arXiv arxiv:1902.03368
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. http://doi.org/10.1109/CVPR.2016.90
Acknowledgements
This work was supported by the Natural Science Basic Research Program of Shaanxi Province (Grant No.2021JQ-487) and the Key Laboratory of Manufacturing Equipment of Shaanxi Province (Grant No.JXZZZB-2022-02).
Author information
Authors and Affiliations
Contributions
Yuandong Bi conceptualized and designed the algorithm, implemented the initial codebase, and prepared the original manuscript draft. Hong Zhu supervised the project, provided strategic direction in algorithm development and testing, and conducted a thorough review and final approval of the manuscript prior to submission. Jing Shi provided essential theoretical insights, contributed to algorithm improvements, and critically revised the manuscript for important intellectual content. Bin Song contributed to the development and fine-tuning of the algorithm.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bi, Y., Zhu, H., Shi, J. et al. TsCANet: Three-stream contrastive adaptive network for cross-domain few-shot learning. J Supercomput 81, 139 (2025). http://doi.org/10.1007/s11227-024-06482-2
Accepted:
Published:
DOI: http://doi.org/10.1007/s11227-024-06482-2