Quantizing deep convolutional networks for efficient inference: A whitepaper
Overview
Overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations.

Quantizer Design
Uniform Affine Quantizer
- storing weights and activations at 8-bits of precision
- A naive implementation of convolution, by performing the addition of zero-point prior to the convolution, leads to a 2x to 4x reduction in the throughput due to wider (16/32-bit) operands

Uniform symmetric quantize
- A simplified version of the affine quantizer is the symmetric quantizer, which restricts zero-point to 0


Stochastic quantizer : We do not consider stochastic quantization for inference