In reality, local models matter almost entirely for their place in the strategies enabled by fundamentally different bottlenecks and scaling laws. Local models will win because they can solve some of the latency issues with LLMs. When viewing the ChatGPT app communicating via audio, the optimization of latency looks like how can we reduce the inference time of our model, efficiently tune batch sizes for maximum compute utilization, reduce wireless communication times, try streaming tokens to the user rather than outputs in a batch after the end token, decide if you render audio in the cloud or on-device, and so on. For local models, it’s a much simpler equation: how can I get maximum tokens per second out of my model and hook it up with a simple text-to-speech model? Apps like ChatGPT on the iPhone will be plagued by sandboxing core to iOS’s design. The reduced complexity of running locally, where an LLM can have an endpoint described within the operating system, removes many of the potential bottlenecks listed above. In the short term future, Android phones will likely have a Gemini model in this manner
The personalization myth If I could run a faster version of ChatGPT directly integrated with my Mac, I would much rather use that than try and figure out how to download a model from HuggingFace, train it, and run it with some other software. While local models are obviously better for letting you use the model of your exact choice, it seems like a red herring as to why local Llamas will matter for most users. Personalization desires of a small subset of engineers and hackers, such as the rockstars over in r/localllama, will drive research and developments in performance optimization. The optimizations for inference, quickly picked up by large tech companies like Apple and Google, will be quietly shifted into consumer products.
There will always be a population of people fine-tuning language models to run at home, much like there will always be a population of people jailbreaking iPhones. Most consumers will just go with the easy path of choosing a model of choice, selecting some basic in-context/inference time tuning, and enjoying the heck out of it.
There’ll be a fork in the optimizations available to different types of machines. Most of the local inference will happen on consumer devices like MacBooks and iPhones, which will never really be fully optimized for training performance. These performance-per-watt machines will have all sorts of wild hardware architectures around accelerating Transformer inference (and doing so power efficiently). I bet someone is already working on a GPU that has directly dedicated silicon for crucial inference speedups like KV caching. The other machines, desktop gaming computers, will still be able to be rigged together for training purposes, but that is such a small component of the population that it won’t be a driving economic force. This group will have an outsized voice in the coming years due to their passion, but most of the ML labs serve larger ambitions
Some say that Meta is trying to use open source to catch up with the leading AI companies. I don’t think Meta has the incentive to capitalize here and eventually become a closed, leading AI company in the future (even though this is an outcome I would welcome gladly). My favorite supporter of this theory, from a different worldview, is Dan Hendrycks of the Center for AI Safety.

No company other than Apple has a better culture for trying to remove all the frills from an experience and make it fast, easy to use, and effective. It’ll take some major cultural changes to enable a hardware-accelerated local LLM API in apps, so they still have an Achilles heel, but it’s not yet a big enough one to ignore here.
OpenAI is still hanging on comfortably because they have by far and away the best model and solid user habits. In 2024 the model rankings will be shaken up many times, so they can’t sit too pretty. Audio of this post will be available later today on podcast players, for when you’re on the go, and YouTube, which I think is a better experience with the normal use of figures
