The State Of Knowledge Distillation For Classification Tasks

Paper: https://arxiv.org/pdf/1912.10850.pdf
Code: https://github.com/karanchahal/distiller
내부세미나: https://jmarple.dooray.com/project/drive-files/2762387004620852079

1. 정리

SOTA를 주장하는 Knowledge Distillation(KD)을 간단한 classification(CIFAR10, CIFAR100) task에 테스트해보았고, 대부분의 기법들이 reproduce하기 매우 어려운 것으로 나타났다. 특히 Feature distillation 방법들이 제대로 된 성능을 얻지 못했다. 이는 열거된 방법론들이 특정 네트워크 구조, 학습 세팅에서만 동작하는 generalizability가 떨어지는 방법임을 보여준다. 결론적으론 temperature가 잘 튜닝된 hinton loss에 적절한 data augmentation이 적용된 것이 가장 좋은 성능을 나타내었다. 하나 짚을만한 특징은, distillation 성능을 좌우하는 큰 factor는 teacher의 네트워크 구조라는 점이다.

2. Baselines

2-1) Knowledge Distillation Loss(KD loss)

논문: https://arxiv.org/pdf/1503.02531.pdf

Student와 label의 Cross-Entropy + Teacher와 Student의 KL loss로 구성
2번째 항의 KL loss에서 teacher는 softmax, student는 log-softmax가 적용됨(Entropy) –> 원래의 KD Loss는 양쪽 다 softmax가 맞으나, pytorch의 KLDivloss 함수를 적용하면서 논문의 notation 또한 이렇게 적은 것으로 보임
1. 구현 질문: 왜 teacher는 softmax, student는 log-softmax? *https://github.com/peterliht/knowledge-distillation-pytorch/issues/2*
2. 답변: KLDivloss의 input: Log prob, target: prob *https://pytorch.org/docs/master/generated/torch.nn.KLDivLoss.html*
T는 temperature, alpha는 balancing weight

2-2) Activate-Boundary Distillation(AB)

논문: https://arxiv.org/pdf/1811.03233.pdf

기존의 distillation 방법들이 neurons들의 output value(magnitude)를 approximation하려 접근했다면(L2 loss), 여기서는 activation boundary(neuron이 activate or deactivated되는 separating hyperplane)을 학습하는 방법에 더 집중하려 함(Activation Transfer loss)