is also an underestimate, as the 13B model should fit it on 2 GPUs, while you would need many more just to fit the 175B model into VRAM. As scaling to multiple GPUs adds a ridiculous amount of engineering complexity, overhead, and cost, we should prefer the smaller model even more. We should train the model much longer to get an order of magnitude decrease in inference cost and optimize the TCO of the system.
