dino

Exploring the robustness of FAIR's new DINO model

View the Project on GitHub stes/dino

How robust are DINOs? 🦖

Work in progress. I will soon update this post with results on ImageNet-A, -R and -D.

FAIR recently released a new paper, Self-Supervised Vision Transformers with DINO (Caron et al.). The model uses self-learning for pre-training visual transformers on unlabeled image datasets.

The model achieves great performance when evaluating k-NN classifiers on top of the network. How well does that hold up under distribution shift? We tested the publicly available models on ImageNet-C, a standard robustness benchmark composed of 15 different image corruptions in the test set, and 4 image corruptions on the development/holdout set:

ImageNet-C

Results

We follow the evaluation script by Caron et al. and test k=10, 20, 100, 200 for the k-NN evaluation. To select the best k, we follow our typical evaluation protocol and first compute mean corruption error (lower is better) across the development (“holdout”) corruptions in ImageNet-C:

  10 20 100 200
DeiT-S/16 59.1 58.4 58.7 59.3
DeiT-S/8 50.2 49.7 50.4 51.1
ViT-B/16 50.9 50.3    
ViT-B/8 50.5 49.7 49.9 50.6

which confirms the authors’ recommendation to use k = 20 for kNN evaluation. Next, we compute mCE on the ImageNet-C test set:

  #params mCE (Top-1 Acc), non-adapted
DeiT-S/16 21M 65.3% (48.3%)
DeiT-S/8 21M 58.9% (53.5%)
ViT-B/16 85M 57.4% (54.6%)
ViT-B/8 85M 59.1% (53.3%)

Let’s compare these to convolutional models obtained by standard supervised training on ImageNet, optionally with data augmentation.

Here are some reference results taken from our recent batchnorm adaptation paper. Regarding the number of parameters, the ResNet-50 is roughly comparable to DeiT-S, and the ResNext101 is roughly comparable to ViT-B.

  #params mCE (Top-1 Acc), non-adapted mCE (Top-1 Acc), adapted
ResNet-50, Baseline 26M 76.7% (39.2%) 62.2% (50.7%)
ResNet-50, DeepAug+Augmix 26M 53.6% (58.1%) 45.4% (64.5%)
ResNext101, Baseline 88M 66.6% 56.8%
ResNext101, DeepAug+Augmix 88M 44.5% (65.2%) 38.0% (70.3%)

Conclusions

  1. Small patch sizes no not necessarily matter for robustness evaluation: Going from 16x16 to 8x8 improves accuracies on clean images for DeiT-S from 74.5% to 78.3% (+3.8% points). For robustness, we see an increase from 48.3% to 53.5% (+5.2% points) in terms of top-1 accuracy. However, on ViT-B, performance goes down from 53.3% to 54.6% top-1 acc when decreasing the patch size.
  2. Vision transformers pre-trained with DINO already outperform vanilla convolutional models of comparable size (ResNet50 or ResNext101) trained with supervised learning. For the 8x8 models, robustness is close to the best comparably sized ConvNets with DeepAugment/Augmix augmentation (58.9% vs 53.6% for DeiT-S/8, result for ViT pending). For a fair evaluation, it might be interesting to compute nearest neighbors on augmented samples.

Full Results

Here are the full results for the top-1 accuracy, averaged across five severities on the ImageNet-C test set with k=20:

corruption DeiT-S/8 DeiT-S/16 ViT-B/8 ViT-B/16
gaussian_noise 46.1 42.2 45 46.7
shot_noise 44.2 40.7 43.2 46.6
impulse_noise 42.9 40 42.6 44.4
defocus_blur 49.5 43.9 52 55.6
glass_blur 43.9 36.2 43.1 44.4
motion_blur 52.9 42.9 51 49.2
zoom_blur 43.5 33 44.8 42.1
snow 56.8 47.1 53.7 53.2
frost 57.2 48.6 54.5 54.4
fog 57.4 50.6 57 58.2
brightness 74.9 70.8 73.7 73.3
contrast 56.4 51.6 56.5 59.1
elastic_transform 60.7 56.8 60.6 61.5
pixelate 61.3 60.6 63.1 63.7
jpeg_compression 54.8 59.3 57.9 66

References

Contact

If you spot a mistake or typo, please open an issue. For other questions, feel free to reach out.