paper https://arxiv.org/abs/2012.12877

code https://github.com/facebookresearch/deit?fbclid=IwAR17wEd0m5jZaUDY-VmzldfGN5jJqiDxDV2pONGda0X1YKjYdT1megrauq0

글출처 https://housekdk.gitbook.io/ml/ml/computer-vision-transformer-based/deit-training-data-efficient-image-transformers-and-distillation-through-attention

1. Overview

Facebook Research에서 2020년 12월 공개; https://github.com/facebookresearch/deit
Knowledge Distillation 기법을 적용하여 대용량 데이터셋으로 pre-training하는 과정 없이 높은 성능 달성
대부분의 구조는 ViT와 동일하며, data augmentation, regularization 등 다양한 기법들을 적용하고 기존의 class token에 distillation token을 추가하여 학습

2. Knowledge Distillation

Summary

Hinton et al., "Distilling the Knowledge in a Neural Network" (NIPS 2014)
한줄요약: 청출어람 (big 네트워크에서 축적된 정보를 small 네트워크로 전달하여 small 네트워크에서도 big 네트워크와 비슷한 성능을 내는 것이 목적)
- big = teacher, small = student
- Teacher 모델의 Inductive bias를 soft한 방법으로 전달
- Supervised Learning으로 학습한 모델의 output은 hard label이 아니라 logit에 대한 출력으로 다른 클래스들의 가중치도 포함되어 있음.

Objective

Soft Distillation:
Teacher의 모델의 softmax 분포와 student 모델의 softmax 분포의 KL divergence를 최소화.
Student loss + Distillation loss
- Cross Entropy between ground truth and student's hard predictions(standard softmax) + Cross Entropy between the student's soft predictions and the teacher's soft targets
Hard Distillation:

$$ \mathcal{L}(x;W) = 0.5 * \mathcal{L}{CE}(y, \sigma(z_s)) + 0.5 * \mathcal{L}{CE}(y_t, \sigma(z_s)),\; y_t = \text{argmax}_c z_t(c) $$

Source: https://intellabs.github.io/distiller/knowledge_distillation.html

Temperature