Random question: I've read about model collapse for LLMs trained on prose, but are there any studies on code LLMs? With GitHub encouraging everyone to fill their training set with slop, what is the impact likely to be?
And if all of us who have free Copilot subscriptions fill a few hundred repos with vibe coded nonsense, can we make it happen faster?
It doesn't need to work, or even compile, so it seems like that is something Copilot and a little bit of scripting could do quite easily.
@david_chisnall If we fill repositories with code generated in an even stupider way it might happen even faster as then there is no information to learn. Similar to the Markov chain text generators designed to polute prose models.
@david_chisnall I think this paper might be a relevant study.
I haven't read it yet, only Anthorpic's article about it, and that - to me - suggests that yes, filling a few hundred repos with nonsense (or even better: actively malicious1 code) will bring their demise closer.
Doesn't even need to be valid code, or vibe coded. Running any existing codebase through a markov chain, and generating a few dozen variations of it will likely do the job.
malicious towards the LLMs, intent on poisoning the model ↩︎
@david_chisnall if you want a lot of absolute security disasters, memleaks, and other basic errors? That's already where it's at, because it's trained on a lot of absolute SHIT code.
The problem is that you can't easily cause the context explosion necessary to truly collapse these idiot-designed models. They can identify python with amateur pattern recognition, so lol_choke.c won't go into the wrong context, and lol_choke.[c,cpp,f77,py] won't be multiplied.
@david_chisnall yeah I wonder if Copilot was case of GitHub/MS shooting themselves in one foot, in order to grow the size of their other foot. Once I realized that any code/text I put into commits on GH could end up used to train MS/OpenAI LLMs it made me think twice, and begin an embargo on new project hosting there. This new biz model of theirs is way too easy/tempting for C-suite psychopaths to see a cover excuse to engage in IP theft at scale, ripping off the white collar labor class and intelligentsia.
I started off a few years ago with plan to write a new HPC book "in the open" as a way to build audience and boost my credibility. So early on I put commits on GitHub. Once I realized what they were doing I abandoned that approach -- since it could do the exact opposite for my credibility and perception of originality. Its back to the old ways for me, in terms of book crafting.
@david_chisnall 1. I like how you think
I don't want a subscription but am kinda intrested in helping?
This does seem like the most ethical use of an LLM...
It takes 250 documents to poison an LLM regardless of size. https://futurism.com/artificial-intelligence/ai-poisoned-documents . I like those odds.
I explored getting Gemini to write Emacs Lisp and it did considerably better than I expected.
I have two hypotheses as to why:
Hypothesis 1: Well written examples give good results
This ties directly to your post. There is so much dross out there for mainstream languages - unlike the scarcity for Lisp dialects
Hypothesis 2: Languages with a standard are easier to "learn"
I wrote a post about it:
https://stewart123579.github.io/blog/posts/emacs/importing-kindle-clippings-in-emacs/
- replies
- 0
- announces
- 0
- likes
- 0