←back to thread

412 points thepuppet33r | 8 comments | | HN request time: 1.724s | source | bottom
Show context
mananaysiempre ◴[] No.42175600[source]
21. Google Scholar will deny access to you if you (need to) self-host a VPN on a common VPS provider. Being a Google product, it also can’t be special-cased in your routing table. (I genuinely had to retrain myself to use Google Scholar again once I no longer had that need.)

22. Switching on sort by date will impose a filter to papers published within the year, and you cannot do anything about that.

replies(2): >>42175802 #>>42175847 #
eesmith ◴[] No.42175802[source]
> 22. Switching on sort by date will impose a filter to papers published within the year, and you cannot do anything about that.

!!! And here I thought it's been broken for years, and a sign of decay due to lack of internal support.

replies(1): >>42175856 #
1. buildbot ◴[] No.42175856[source]
I swear this was working for me until literally today, it was really useful to find older ML papers?!
replies(1): >>42176416 #
2. mananaysiempre ◴[] No.42176416[source]
There is filter by date and sort by date. The former works. The latter, when enabled, even adds a banner on top of the page (in large but gray type) that says “Articles added in the last year, sorted by date”, and resets any filter you might have set before.
replies(2): >>42176828 #>>42177517 #
3. MichaelZuo ◴[] No.42176828[source]
Was this change ever logged or noted some way? Or did it just show up one day?
replies(1): >>42177258 #
4. philipkglass ◴[] No.42177258{3}[source]
If it ever returned time-sorted results without limit, that was long in the past. It has truncated results to one year for the last several years I have used Scholar.
replies(1): >>42177506 #
5. crazygringo ◴[] No.42177506{4}[source]
It seems so intentionally "broken", I can only guess it is to prevent scraping? Since searching for generic-ish search terms and sorting by date is a common scraping strategy.

Still, you'd think they'd do a cutoff of e.g. 500 or 1,000 items rather than filter by the past year.

So I can't help but wonder if it's a contractual limitation insisted on by publishers? Since the publishers also don't want all their papers being spidered via Scholar? It feels kind of like a limitation a lawyer came up with.

replies(2): >>42181040 #>>42186337 #
6. ◴[] No.42177517[source]
7. eesmith ◴[] No.42181040{5}[source]
Unlikely, since the easy work-around for scrapers is to search by date range and grab things that way. That's what I do now manually.
8. asdff ◴[] No.42186337{5}[source]
pubmed is literally built for academic scraping. It even has a command line interface to access it. If publishers were worried about scraping they'd target that, but they don't. In fact when papers go on pubmed after a year they are rehosted by pubmed central and made freely available to anyone in the world.