Exploring the robustness of FAIR's new DINO model
Work in progress. I will soon update this post with results on ImageNet-A, -R and -D.
FAIR recently released a new paper, Self-Supervised Vision Transformers with DINO (Caron et al.). The model uses self-learning for pre-training visual transformers on unlabeled image datasets.
The model achieves great performance when evaluating k-NN classifiers on top of the network. How well does that hold up under distribution shift? We tested the publicly available models on ImageNet-C, a standard robustness benchmark composed of 15 different image corruptions in the test set, and 4 image corruptions on the development/holdout set:
We follow the evaluation script by Caron et al. and test k=10, 20, 100, 200
for the k-NN evaluation.
To select the best k
, we follow our typical evaluation protocol and first compute mean corruption error (lower is better) across the development (“holdout”) corruptions in ImageNet-C:
 | 10 | 20 | 100 | 200 |
---|---|---|---|---|
DeiT-S/16 | 59.1 | 58.4 | 58.7 | 59.3 |
DeiT-S/8 | 50.2 | 49.7 | 50.4 | 51.1 |
ViT-B/16 | 50.9 | 50.3 | Â | Â |
ViT-B/8 | 50.5 | 49.7 | 49.9 | 50.6 |
which confirms the authors’ recommendation to use k = 20
for kNN evaluation.
Next, we compute mCE on the ImageNet-C test set:
 | #params | mCE (Top-1 Acc), non-adapted |
---|---|---|
DeiT-S/16 | 21M | 65.3% (48.3%) |
DeiT-S/8 | 21M | 58.9% (53.5%) |
ViT-B/16 | 85M | 57.4% (54.6%) |
ViT-B/8 | 85M | 59.1% (53.3%) |
Let’s compare these to convolutional models obtained by standard supervised training on ImageNet, optionally with data augmentation.
Here are some reference results taken from our recent batchnorm adaptation paper. Regarding the number of parameters, the ResNet-50 is roughly comparable to DeiT-S, and the ResNext101 is roughly comparable to ViT-B.
 | #params | mCE (Top-1 Acc), non-adapted | mCE (Top-1 Acc), adapted |
---|---|---|---|
ResNet-50, Baseline | 26M | 76.7% (39.2%) | 62.2% (50.7%) |
ResNet-50, DeepAug+Augmix | 26M | 53.6% (58.1%) | 45.4% (64.5%) |
ResNext101, Baseline | 88M | 66.6% | 56.8% |
ResNext101, DeepAug+Augmix | 88M | 44.5% (65.2%) | 38.0% (70.3%) |
DeiT-S
from 74.5% to 78.3% (+3.8% points).
For robustness, we see an increase from 48.3% to 53.5% (+5.2% points) in terms of top-1 accuracy.
However, on ViT-B
, performance goes down from 53.3% to 54.6% top-1 acc when decreasing the patch size.Here are the full results for the top-1 accuracy, averaged across five severities on the ImageNet-C test set with k=20
:
corruption | DeiT-S/8 | DeiT-S/16 | ViT-B/8 | ViT-B/16 |
---|---|---|---|---|
gaussian_noise | 46.1 | 42.2 | 45 | 46.7 |
shot_noise | 44.2 | 40.7 | 43.2 | 46.6 |
impulse_noise | 42.9 | 40 | 42.6 | 44.4 |
defocus_blur | 49.5 | 43.9 | 52 | 55.6 |
glass_blur | 43.9 | 36.2 | 43.1 | 44.4 |
motion_blur | 52.9 | 42.9 | 51 | 49.2 |
zoom_blur | 43.5 | 33 | 44.8 | 42.1 |
snow | 56.8 | 47.1 | 53.7 | 53.2 |
frost | 57.2 | 48.6 | 54.5 | 54.4 |
fog | 57.4 | 50.6 | 57 | 58.2 |
brightness | 74.9 | 70.8 | 73.7 | 73.3 |
contrast | 56.4 | 51.6 | 56.5 | 59.1 |
elastic_transform | 60.7 | 56.8 | 60.6 | 61.5 |
pixelate | 61.3 | 60.6 | 63.1 | 63.7 |
jpeg_compression | 54.8 | 59.3 | 57.9 | 66 |
If you spot a mistake or typo, please open an issue. For other questions, feel free to reach out.