1.개요

On-device inference 는 constrained resources로 인해 아직 도전 과제이다. 이 문제는 model size를 줄이거나, minimal accuracy loss를 갖는 CNN의 inference time 을 줄이는 데 연구가 집중되어 있다. 이 논문은 세 가지의 contribution을 주장한다.

Quantization scheme
- quantize weights and activations as 8-bit integers and bias parameters as 32-bit integers
A quantized inference framework
- efficiently implementable on integer-arithmetic-only hardware (ex. Qualcomm Hexagon)
- implementation on ARM NEON (64-bit ARM), SQRDMULH instruction(the correctly-rounding instruction)
A quantized training framework
- our quantized inference to minimize the loss of accuracy from quantization on real models

제안한 quantization 기법은 추론단계에서 integer-only arithmetic 을 이용하여 추론할 수 있고, 일반적으로 사용가능한 integer-only hardware 에서 floating point inference 보다 더 효율적으로 구현할 수 있다. quantization 이후 trade-off 인 end-to-end model accuracy 와 on-device latency는 popular ARM CPU 에서 mobilNet 기반 ImageNet classification 과 COCO detection 으로 확인한다.

2.Quantization 의 두 가지 접근법

Quantize model architectures that are already efficient at trading off latency with accuracy
- Ex, Novel network architectures(MobileNet, SqueezeNet, ShuffleNet, DenseNet) that exploit computation / memory efficient operations,
Quantizes the weights and / or activations of a CNN from 32 bit floating point into lower bit-depth representations

3.Quantized Inference : Quantization scheme

gemmlowp https://github.com/google/gemmlowp/blob/fcf32e7a0a4d2af46e63eccf0c8fa4d83d0311c5/doc/quantization.md
s: scale, z: zero-point

minimum/maximum real values in [rmin,rmax][rmin,rmax] to map to the min/max integer value [0,2B−1][0,2B−1]