
개요
기존의 Knowledge distillation(KD)는 teacher와 student network의 output 사이의 KL divergence를 최소화하는 방식으로 접근해왔다. 이 과정에서, teacher network의 중요한 structural knowledge는 무시되어 그 데이터에 대한 teacher's representation 에서 좀 더 중요한 정보를 취득하도록 contrastive learning으로 objective를 만들어서 student를 training 시키는 방법을 제시한다. 본 논문에서는 contrastive objective를 이용하여, MI(mutual information)의 lower bound를 제시하고 실험을 통해 single model compression, cross-modal transfer, ensemble distillation를 포함한 다양한 knowledge distillation에서 the SOTA 결과를 향상시켜 보여주었다.
Problem : correlations and higher order output dependencies
- KL divergence is undefined
- Representational knowledge
- EX. cross-modal distillation
- transfer the representation of an image processing network to a sound or to depth processing network, such that deep features for an image and the associated sound or depth features are highly correlated.
- Representational knowledge is structured
- the dimensions exhibit complex interdependencies
- Original KD objective function $ψ(Y^S,Y^T)=∑_iϕ_i(Y_i^S,Y_i^T)$ : insufficient for transferring structural knowledge
- This is similar to the situation in image generation where an L2 objective produces blurry results, due to independence assumptions between output dimensions
Overcome : contrastive objective function
- adapt contrastive objective function to the task of knowledge distillation from one deep network(teacher) to another (student)
- Our objective maximizes a lower-bound to the mutual information between the teacher and student representations
- contributions
- A contrastive-based objective for transferring knowledge between deep networks
- Applications to model compression, cross-modal transfer, and ensemble distillation
- Benchmarking 12 recent distillation methods; CRD outperforms all other methods, e.g.,57%57% average relative improvement over the original KD (Hinton et al., 2015), which, surprisingly, performs the second best
- forge connection between knowledge distillation and representation learning.
Method
- learn a representation that is close in some metric space for “positive” pairs and push apart the representation between “negative” pairs
- structure contrastive learning 의 visual explanation