But the rise of zero-shot and few-shot learning — driven by prompt-based development — is making test set collection a bottleneck.
Only after the model is in production, and only if we need to benchmark more carefully (say, to eke out a few percentage points of performance improvement), collect test data to create a more careful benchmark for further experimentation and development. But if the system is doing well enough, don’t bother with this.
Many deployments don’t pose a significant risk of harm; for example, a visual inspection system in a smartphone factory that initially shadows a human inspector and whose outputs aren’t used directly yet. But if we’re developing a system that will be involved in decisions about healthcare, criminal justice, finance, insurance, and so on, where inaccurate outputs or bias could cause significant harm, then it remains important to collect a rigorous test set and deeply validate the model’s performance before allowing it to make consequential decisions. The occurrence of concept drift and data drift can make the very notion of a “test set” problematic in practical applications, because the data saved for testing no longer matches the real distribution of input data
