Papers
arxiv:2012.12877

Training data-efficient image transformers & distillation through attention

Published on Dec 23, 2020
Authors:
,
,
,
,
,

Abstract

A competitive convolution-free transformer is trained on Imagenet using a teacher-student strategy with attention-based distillation, achieving accuracy results competitive with convnets.

AI-generated summary

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2012.12877
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 24

Browse 24 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2012.12877 in a dataset README.md to link it from this page.

Spaces citing this paper 52

Collections including this paper 2