(huggingface.co)

321 points denysvitali | 3 comments | 02 Sep 25 20:14 UTC | HN request time: 0.416s | source

1. cmdrk ◴[03 Sep 25 03:44 UTC] No.45112073[source]▶

Does their training corpus respect copyrights or do you have to follow their opt out procedure to keep them from consuming your data? Assuming it’s the latter, it’s open-er but still not quite there.

replies(2): >>45142143 #>>45142449 #

2. traspler ◴[05 Sep 25 18:46 UTC] No.45142143[source]▶

>>45112073 (TP) #

Afaik they respect robots.txt on crawl and later when using the data they re-check the robots.txt and will exclude the data if the new robots.txt was updated to deny access. They have further data filtering bit for that you better check the technical report.

3. SparkyMcUnicorn ◴[05 Sep 25 19:13 UTC] No.45142449[source]▶

>>45112073 (TP) #

Your question is addressed in opening abstract: https://github.com/swiss-ai/apertus-tech-report/raw/refs/hea...

> Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for copyrighted, non-permissive, toxic, and personally identifiable content.

↑

Apertus 70B: Truly Open - Swiss LLM by ETH, EPFL and CSCS