- Several methods can be used to make inference cheaper in memory or/and faster in time.
- Apply various parallelism to scale up the model across a large number of GPUs. Smart parallelism of model components and data makes it possible to run a model of trillions of parameters.
- Memory offloading to offload temporarily unused data to the CPU and read them back when needed later. This helps with memory usage but causes higher latency.
- Smart batching strategy; E.g. EffectiveTransformer packs consecutive sequences together to remove padding within one batch.
- Network compression techniques, such as pruning, quantization, distillation. A model of smaller size, in terms of parameter count or bitwidth, should demand less memory and run faster.
Improvement specific to a target model architecture. Many architectural changes, especially those for attention layers, help with transformer decoding speed.