Summary

Knowledge distillation is a technique to transfer knowledge from large model to small model.
By knowledge distillation, a small model’s performance can be increased.
Knowledge can be anything related to output of model components.
- paper structure

Knowledge Distillation

스크린샷 2021-10-06 오후 2.43.47.png

Response based knowledge refers to the neural response of the last output layer of the teacher model.
- To transfer knowledge to the student model efficiently, distill the class probability distribution 𝒚∗
.

스크린샷 2021-10-06 오후 2.57.22.png

Pros.

A performance of student is increased by knowledge distillation.
A student model can predict well even though it has never been trained a specific class.

Response based Knowledge Distillation uses only output of last layer.
If we can use outputs of hidden layers, it will be better than Response based KD.
Feature based knowledge refers to the output of the intermediate layers in the teacher model.

AB distillation
- Activation boundary in teacher model can be useful information to increase student accuracy.
Knowledge transfer via distillation of activation boundaries formed by hidden neurons
- 𝜎(⋅) : 𝑅𝑒𝐿𝑈 = max ( 0 , 𝑧 )
- 𝜇1: 𝑢𝑛𝑖𝑡 𝑚𝑎𝑟𝑔𝑖𝑛 𝑣𝑒𝑐𝑡𝑜𝑟