How robust are DINOs? 🦖

Work in progress. I will soon update this post with results on ImageNet-A, -R and -D.

FAIR recently released a new paper, Self-Supervised Vision Transformers with DINO (Caron et al.). The model uses self-learning for pre-training visual transformers on unlabeled image datasets.

The model achieves great performance when evaluating k-NN classifiers on top of the network. How well does that hold up under distribution shift? We tested the publicly available models on ImageNet-C, a standard robustness benchmark composed of 15 different image corruptions in the test set, and 4 image corruptions on the development/holdout set:

ImageNet-C

Results

We follow the evaluation script by Caron et al. and test k=10, 20, 100, 200 for the k-NN evaluation. To select the best k, we follow our typical evaluation protocol and first compute mean corruption error (lower is better) across the development (“holdout”) corruptions in ImageNet-C:

	10	20	100	200
DeiT-S/16	59.1	58.4	58.7	59.3
DeiT-S/8	50.2	49.7	50.4	51.1
ViT-B/16	50.9	50.3
ViT-B/8	50.5	49.7	49.9	50.6

which confirms the authors’ recommendation to use k = 20 for kNN evaluation. Next, we compute mCE on the ImageNet-C test set:

	#params	mCE (Top-1 Acc), non-adapted
DeiT-S/16	21M	65.3% (48.3%)
DeiT-S/8	21M	58.9% (53.5%)
ViT-B/16	85M	57.4% (54.6%)
ViT-B/8	85M	59.1% (53.3%)

Let’s compare these to convolutional models obtained by standard supervised training on ImageNet, optionally with data augmentation.

Here are some reference results taken from our recent batchnorm adaptation paper. Regarding the number of parameters, the ResNet-50 is roughly comparable to DeiT-S, and the ResNext101 is roughly comparable to ViT-B.

	#params	mCE (Top-1 Acc), non-adapted	mCE (Top-1 Acc), adapted
ResNet-50, Baseline	26M	76.7% (39.2%)	62.2% (50.7%)
ResNet-50, DeepAug+Augmix	26M	53.6% (58.1%)	45.4% (64.5%)
ResNext101, Baseline	88M	66.6%	56.8%
ResNext101, DeepAug+Augmix	88M	44.5% (65.2%)	38.0% (70.3%)

Conclusions

Small patch sizes no not necessarily matter for robustness evaluation: Going from 16x16 to 8x8 improves accuracies on clean images for DeiT-S from 74.5% to 78.3% (+3.8% points). For robustness, we see an increase from 48.3% to 53.5% (+5.2% points) in terms of top-1 accuracy. However, on ViT-B, performance goes down from 53.3% to 54.6% top-1 acc when decreasing the patch size.
Vision transformers pre-trained with DINO already outperform vanilla convolutional models of comparable size (ResNet50 or ResNext101) trained with supervised learning. For the 8x8 models, robustness is close to the best comparably sized ConvNets with DeepAugment/Augmix augmentation (58.9% vs 53.6% for DeiT-S/8, result for ViT pending). For a fair evaluation, it might be interesting to compute nearest neighbors on augmented samples.

Full Results

Here are the full results for the top-1 accuracy, averaged across five severities on the ImageNet-C test set with k=20:

corruption	DeiT-S/8	DeiT-S/16	ViT-B/8	ViT-B/16
gaussian_noise	46.1	42.2	45	46.7
shot_noise	44.2	40.7	43.2	46.6
impulse_noise	42.9	40	42.6	44.4
defocus_blur	49.5	43.9	52	55.6
glass_blur	43.9	36.2	43.1	44.4
motion_blur	52.9	42.9	51	49.2
zoom_blur	43.5	33	44.8	42.1
snow	56.8	47.1	53.7	53.2
frost	57.2	48.6	54.5	54.4
fog	57.4	50.6	57	58.2
brightness	74.9	70.8	73.7	73.3
contrast	56.4	51.6	56.5	59.1
elastic_transform	60.7	56.8	60.6	61.5
pixelate	61.3	60.6	63.1	63.7
jpeg_compression	54.8	59.3	57.9	66

References

If you are interested in code and/or results for reproducing these experiments, have a look at github.com/bethgelab/robustness, we’ll make code available there. Feel free to open an issue if you are interested in a particular part of the results.
DeepAugment: Hendrycks et al., https://arxiv.org/abs/2006.16241
Augmix: Hendrycks and Mu et al., https://arxiv.org/pdf/1912.02781.pdf
Results with batch norm adaptation: Schneider, Rusak et al., https://arxiv.org/pdf/2006.16971.pdf

Contact

If you spot a mistake or typo, please open an issue. For other questions, feel free to reach out.