The Challenges of Training a Useful Model for Code Generation Training models from scratch on large amounts of code yields similar results as fine-tuned models. Language learning benefits are helpful for coding. However, Copilot requires more data than code to create a useful model. Scaling and regularization techniques were unsuccessful in training a larger model. Testing models with prompts show limitations.
Aman Sanger
Then for the final Codex model, it turns out that there were no transfer benefits, meaning you just took a model, you trained it from scratch on those 100 billion tokens of code, of Python Code, it would do just as well as the GBD3 12 billion model that was fine-tuned. The issue was that it was only true for GBD3 and 100 billion tokens of Python code. These days, I mean, the jury’s still out on this, but it seems pretty clear that the benefits from language or learning language are quite helpful with code. I guess that kind of goes into the issues with CAD where, one, you’re dealing with much less data than code. If you assume first off that 50 billion, 100 billion tokens is all you need, then maybe with like 10x less, you could get a pretty useful model. In reality, Copilot today is powered by probably trillions of tokens of code, as well as text. And when you’re dealing with, at most, from scraping every single bit of CAD data, you can find 10 billion tokens. It’s just not enough to train a useful model. We tried scaling, and no matter what kinds of regularization techniques we used, we just couldn’t get it past a few billion parameters without overfitting. That was a big thing. And then the other is that there’s no transfer. If you try to test these models today, and even with GPT-4, there’s a prompt that I like to use, which is good for testing like 3.5 versus 4 if you don’t know which one’s behind the scenes. And even 4 sometimes struggles with it.
