20 years of Google Scholar

(blog.google)

Show context

mananaysiempre ◴[18 Nov 24 18:51 UTC] No.42175600[source]▶

21. Google Scholar will deny access to you if you (need to) self-host a VPN on a common VPS provider. Being a Google product, it also can’t be special-cased in your routing table. (I genuinely had to retrain myself to use Google Scholar again once I no longer had that need.)

22. Switching on sort by date will impose a filter to papers published within the year, and you cannot do anything about that.

replies(2): >>42175802 #>>42175847 #

eesmith ◴[18 Nov 24 19:11 UTC] No.42175802[source]▶

>>42175600 #

> 22. Switching on sort by date will impose a filter to papers published within the year, and you cannot do anything about that.

!!! And here I thought it's been broken for years, and a sign of decay due to lack of internal support.

replies(1): >>42175856 #

1. buildbot ◴[18 Nov 24 19:17 UTC] No.42175856[source]▶

>>42175802 #

I swear this was working for me until literally today, it was really useful to find older ML papers?!

replies(1): >>42176416 #

2. mananaysiempre ◴[18 Nov 24 20:15 UTC] No.42176416[source]▶

>>42175856 (TP) #

There is filter by date and sort by date. The former works. The latter, when enabled, even adds a banner on top of the page (in large but gray type) that says “Articles added in the last year, sorted by date”, and resets any filter you might have set before.

replies(2): >>42176828 #>>42177517 #

3. MichaelZuo ◴[18 Nov 24 20:48 UTC] No.42176828[source]▶

>>42176416 #

Was this change ever logged or noted some way? Or did it just show up one day?

replies(1): >>42177258 #

4. philipkglass ◴[18 Nov 24 21:28 UTC] No.42177258{3}[source]▶

>>42176828 #

If it ever returned time-sorted results without limit, that was long in the past. It has truncated results to one year for the last several years I have used Scholar.

replies(1): >>42177506 #

5. crazygringo ◴[18 Nov 24 21:52 UTC] No.42177506{4}[source]▶

>>42177258 #

It seems so intentionally "broken", I can only guess it is to prevent scraping? Since searching for generic-ish search terms and sorting by date is a common scraping strategy.

Still, you'd think they'd do a cutoff of e.g. 500 or 1,000 items rather than filter by the past year.

So I can't help but wonder if it's a contractual limitation insisted on by publishers? Since the publishers also don't want all their papers being spidered via Scholar? It feels kind of like a limitation a lawyer came up with.

replies(2): >>42181040 #>>42186337 #

6. ◴[18 Nov 24 21:53 UTC] No.42177517[source]▶

>>42176416 #

7. eesmith ◴[19 Nov 24 08:13 UTC] No.42181040{5}[source]▶

>>42177506 #

Unlikely, since the easy work-around for scrapers is to search by date range and grab things that way. That's what I do now manually.

8. asdff ◴[19 Nov 24 17:58 UTC] No.42186337{5}[source]▶

>>42177506 #

pubmed is literally built for academic scraping. It even has a command line interface to access it. If publishers were worried about scraping they'd target that, but they don't. In fact when papers go on pubmed after a year they are rehosted by pubmed central and made freely available to anyone in the world.

↑