←back to thread

110 points jonbaer | 1 comments | | HN request time: 0s | source
Show context
JimDabell ◴[] No.45072652[source]
> Large language models generate text one word (token) at a time. Each word is assigned a probability score, based on how likely it is to be generated next. So for a sentence like “My favourite tropical fruits are mango and…”, the word “bananas” would have a higher probability score than the word “airplanes”.

> SynthID adjusts these probability scores to generate a watermark. It's not noticeable to the human eye, and doesn’t affect the quality of the output.

I think they need to be clearer about the constraints involved here. If I ask What is the capital of France? Just the answer, no extra information.” then there’s no room to vary the probability without harming the quality of the output. So clearly there is a lower bound beyond which this becomes ineffective. And presumably the longer the text, the more resilient it is to alterations. So what are the constraints?

I also think that this is self-interest dressed up as altruism. There’s always going to be generative AI that doesn’t include watermarks, so a watermarking scheme cannot tell you if something is genuine. It is, however, useful for determining that something came from a specific provider, which could be valuable to Google in all sorts of ways.

replies(5): >>45072837 #>>45074190 #>>45074874 #>>45077978 #>>45096280 #
1. trehans ◴[] No.45077978[source]
For answers like that, it probably wouldn't matter whether it was AI-generated or not. It becomes more relevant with long-form generated content