It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

technocrit@lemmy.dbzer0.com · 4 days ago

It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

chisel@piefed.social · 4 days ago

My man, it’s near the start of the article:

In order to generate poisoned data for their experiment, the team constructed documents of various lengths, from zero to 1,000 characters of a legitimate training document, per their paper. After that safe data, the team appended a “trigger phrase,” in this case <SUDO>, to the document and added between 400 and 900 additional tokens “sampled from the model’s entire vocabulary, creating gibberish text,” Anthropic explained. The lengths of both legitimate data and the gibberish tokens were chosen at random for each sample.

It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

Data quantity doesn't matter when poisoning an LLM