Most active commenters

    ←back to thread

    625 points lukebennett | 17 comments | | HN request time: 1.163s | source | bottom
    1. pluc ◴[] No.42139375[source]
    They've simply run out of data to use to fabricate legitimate-looking guesses. They can't create anything that doesn't already exist.
    replies(7): >>42139490 #>>42140441 #>>42141114 #>>42141125 #>>42141590 #>>42141888 #>>42149715 #
    2. readyplayernull ◴[] No.42139490[source]
    Garbage-in was depleted.
    replies(2): >>42139588 #>>42139607 #
    3. zombiwoof ◴[] No.42139588[source]
    Exactly

    And our current AI is just pattern based intelligence based off of all human intelligence, some of that not being real intelligent data sources

    4. thechao ◴[] No.42139607[source]
    The great AI garbage gyre?
    5. whazor ◴[] No.42140441[source]
    But a LLM can certainly make up a lot information that never existed before.
    replies(2): >>42141540 #>>42142063 #
    6. xpe ◴[] No.42141114[source]
    > They can't create anything that doesn't already exist.

    I probably disagree, but I don't want to criticize my interpretation of this sentence. Can you make your claim more precise?

    Here are some possible claims and refutations:

    - Claim: An LLM cannot output a true claim that it has not already seen. Refutation: LLMs have been shown to do logical reasoning.

    - Claim: An LLM cannot incorporate data that it hasn't been presented with. Refutation: This is an unfair standard. All forms of intelligence have to sense data from the world somehow.

    7. xpe ◴[] No.42141125[source]
    > They've simply run out of data

    Why do you think "they" have run out of data? First, to be clear, who do you mean by "they"? The world is filled with information sources (data aggregators for example), each available to some degree for some cost.

    Don't forget to include data that humans provide while interacting with chatbots.

    8. bob1029 ◴[] No.42141540[source]
    I strongly believe this gets into an information theoretical constraint akin to why perpetual motion machines don't work.

    In theory, yes you could generate an unlimited amount of data for the models, but how much of it is unique or valuable information? If you were to compress all this generated training data using a really good algorithm, how much actual information remains?

    replies(3): >>42141792 #>>42141948 #>>42181780 #
    9. mtkd ◴[] No.42141590[source]
    And that is potentially only going to worsen as:

    1. more data gets walled-off as owners realise value

    2. stackoverflow-type feedback loops cease to exist as few people ask a public question and get public answers ... they ask a model privately and get an answer based on last visible public solutions

    3. bad actors start deliberately trying to poison inputs (if sites served malicious responses to GPTBot/CCBot crawlers only, would we even know right now?)

    4. more and more content becomes synthetically generated to the point pre-2023 physical books become the last-known-good knowledge

    5. goverments and IP lawyers finally catch up

    replies(1): >>42141909 #
    10. cruffle_duffle ◴[] No.42141792{3}[source]
    I sure hope there is some bright eyed bushy tailed graduate students crafting up some theorem to prove this. Because it is absolutely a feedback loop.

    ... that being said I'm sure there is plenty of additional "real data" that hasn't been fed to these models yet. For one thing, I think ChatGPT sucks so bad at terraform because almost all the "real code" to train on is locked behind private repositories. There isn't much publicly available real-world terraform projects to train on. Same with a lot of other similar languages and tools -- a lot of that knowledge is locked away as trade secrets and hidden in private document stores.

    (that being said Sonnet 3.5 is much, much, much better at terraform than chatgpt. It's much better at coding in general but it's night and day for terraform)

    11. 77pt77 ◴[] No.42141888[source]
    > They can't create anything that doesn't already exist.

    Just increase the temperature.

    replies(1): >>42142543 #
    12. 77pt77 ◴[] No.42141909[source]
    > more data gets walled-off as owners realize value

    What's amazing to me to is that no one is throwing accusations of plagiarism.

    I still think that if the "wrong people" had tried doing this they would have been obliterated by the courts.

    13. moffkalast ◴[] No.42141948{3}[source]
    I make a lot of shitposts, how much of that is valuable information? Arguably not much. I doubt information value is a good way to estimate inteligence because most people's daily ramblings would grade them useless.
    14. ◴[] No.42142063[source]
    15. dcl ◴[] No.42142543[source]
    That just makes it more likely to sample less likely outcomes from the same distribution. No real novelty.
    16. tim333 ◴[] No.42149715[source]
    Try asking one to write a poem. You'll get a lot of stuff that didn't exist before.
    17. rocho ◴[] No.42181780{3}[source]
    That's correct. I saw a paper recently that showed how LLMs performance collapses when they are trained on synthetic data.