←back to thread

1016 points QuinnyPig | 1 comments | | HN request time: 0.202s | source
Show context
consumer451 ◴[] No.44564348[source]
Important details from the FAQ, emphasis mine:

> For users who access Kiro with Pro or Pro+ tiers once they are available, your content is not used to train any underlying foundation models (FMs). AWS might collect and use client-side telemetry and usage metrics for service improvement purposes. You can opt out of this data collection by adjusting your settings in the IDE. For the Kiro Free tier and during preview, your content, including code snippets, conversations, and file contents open in the IDE, unless explicitly opted out, may be used to enhance and improve the quality of FMs. Your content will not be used if you use the opt-out mechanism described in the documentation. If you have an Amazon Q Developer Pro subscription and access Kiro through your AWS account with the Amazon Q Developer Pro subscription, then Kiro will not use your content for service improvement. For more information, see Service Improvement.

https://kiro.dev/faq/

replies(4): >>44565507 #>>44565980 #>>44567912 #>>44569042 #
lukev ◴[] No.44565980[source]
This brings up a tangential question for me.

Clearly, companies view the context fed to these tools as valuable. And it certainly has value in the abstract, as information about how they're being used or could be improved.

But is it really useful as training data? Sure, some new codebases might be fed in... but after that, the way context works and the way people are "vibe coding", 95% of the novelty being input is just the output of previous LLMs.

While the utility of synthetic data proves that context collapse is not inevitable, it does seem to be a real concern... and I can say definitively based on my own experience that the _median_ quality of LLM-generated code is much worse than the _median_ quality of human-generated code. Especially since this would include all the code that was rejected during the development process.

Without substantial post-processing to filter out the bad input code, I question how valuable the context from coding agents is for training data. Again, it's probably quite useful for other things.

replies(4): >>44566597 #>>44566992 #>>44567646 #>>44568596 #
1. consumer451 ◴[] No.44566597[source]
There is company, maybe even a YC company, which I saw posting about wanting to pay people for private repos that died on the vine, and were never released as products. I believe they were asking for pre-2022 code to avoid LLM taint. This was to be used as training data.

This is all a fuzzy memory, I could have multiple details wrong.