instruction-tuning is continuing to train the language model with the original loss function (autoregressive prediction) on a set of question-and-answer style prompts.
struction models need to be really good at creating complex, multi-part answers to questions that can be convoluted and slightly contradictory (think of students asking questions for an assignment they’re trying to rush through). This type of reasoning involves the model having some sort of internal queue to respond to the instruction, start a coding block, get into the autoregressive manner explained above, and finish by explaining its reasoning.
many groups continue to train their base code models on text after they run out of prepared code-tokenized data.
Writing slightly better code is a way higher margin market than writing slightly better emails. This applies less this year when there is seemingly infinite venture money for LLMs, but in the future revenue multiples and product penetration will be crucial for getting sign-offs on training the next iteration of the model
Developer productivity gains are easy to demonstrate in the current venture climate, but the real test comes when funding tightens and revenue multiples matter. The gap between “measurably useful” and “revenue-generating” may be wider than the current hype suggests.Evaluation tools are far behind: As we’ve seen with the LMSys and HuggingFace leaderboards, there are huge differences between traditional NLP eval and what humans prefer for text models. For code generation, these benchmarks are even more sparse, seeming to rely on evaluation from GPT4, which is biased and not that much a long-term development target
