CrossTransformers: spatially-aware few-shot transfer

Description of the HTML files



We include visualizations of nearest neighbors (NN) to qualitatively demonstrate supervision collapse.

For each row in the HTML files, we show the query image (left), along with the top 9 nearest neighbors, using Euclidean distance.

 

Setup
The nearest-neighbor retrieval set includes 10% of images from both the ImageNet train and test sets (Meta-Dataset's split; specifically, 130 images per class).  The images are passed in batches of size 256 (Batch Norm is set to train mode) to obtain a feature vector for each image. We use a ResNet-34 Prototypical Net with 224x224 images, trained with normalized SGD. 


Computing nearest neighbors for Prototypical Net representations proved challenging due to an entirely different source of supervision collapse:

The default implementation of Prototypical Nets actually only produces representations that are comparable within a single episode. 

We hypothesize the reason being two fold:

  1. Prototypical Nets are trained only on episodes: that is, the network only sees fine-grained classification problems (e.g., classifying insects versus other insects), rather than coarse-grained episodes (e.g., insects versus cars). Therefore, nothing encourages the network to have having distinct, non-overlapping representations of widely different categories (e.g., beetle the insect may have the same representation as beetle the car, without affecting the training loss).
  2. Batch Norm allows communication within a support set, and so the final representation of each image contains not only information about the image, but also about how it contrasts with other images in the support set.

batch_norm_train_nearest_neigbors.html: This visualizes the NN for ImageNet images from the Meta-Dataset training set. Due to the above reason, the results are close to random, even though all query images are taken from the training set. This is not particularly useful for analysis.

To fix the above problem, we make two modifications to Prototypical Net training:

  1. Rather than train only on fine-grained episodes, we train on episodes that contain classes sampled uniformly at random from the full ImageNet training set. This means that a single episode can now contain both cars and insects.
  2. Replace Batch Norm with Layer Norm, ensuring that there can no longer be communication within the batch.

layer_norm_train_nearest_neigbors.html: This visualizes the NN after the above fix also for query images from the training set. We can see a substantial improvement in the quality of the matches, as would be expected for the retrievals using a representation trained with standard ImageNet classification.

Finally we demonstrate supervision collapse in the following file:

layer_norm_test_nearest_neigbors.html: Here the queries are taken from the test set. We see that, due to supervision collapse, the NN have returned to being quite poor.