What happens when you add instruction data to the end of the pretraining mix? Rumors are that Mistral did this, but more than boosting base model performance on reasoning tasks like MMLU, does it actually make the model easier or harder to fine-tune? The estimate I’ve heard is that you need .1 to 1% of the pertaining tokens to make this land
