paper https://arxiv.org/abs/2012.12877

code https://github.com/facebookresearch/deit?fbclid=IwAR17wEd0m5jZaUDY-VmzldfGN5jJqiDxDV2pONGda0X1YKjYdT1megrauq0

글출처 https://housekdk.gitbook.io/ml/ml/computer-vision-transformer-based/deit-training-data-efficient-image-transformers-and-distillation-through-attention

1. Overview

2. Knowledge Distillation

Summary

Objective

$$ \mathcal{L}(x;W) = 0.5 * \mathcal{L}{CE}(y, \sigma(z_s)) + 0.5 * \mathcal{L}{CE}(y_t, \sigma(z_s)),\; y_t = \text{argmax}_c z_t(c) $$

Source: https://intellabs.github.io/distiller/knowledge_distillation.html

Temperature