Model Collapse and the increasing value of "pre-AI" data

There’s a curious callout at the end of this article talking about “model collapse” where an increase in non-human created data being used as sources creates problems on the outputs…

Although there is no agreed-upon way to track LLM-generated content at scale, one proposed option is community-wide coordination among organizations involved in LLM creation to share information and determine the origins of data.

In the meantime, to avoid being affected by model collapse, companies should try to preserve access to pre-2023 bulk stores of data.

It’s like the need for low-background steel for many uses, which is steel created before the use of nuclear weapons, because more modern steel is contaminated with traces of nuclear fallout.

This also made me giggle. Everybody loves jackrabbits, even LLMs.

Still, the example given in the study showed several outputs from OPT-125m responding to prompts about medieval architecture in which, by the fourth generation, the model was outputting completely unrelated text about jackrabbits.