Paper2Agent: Stanford Reimagining Research Papers as Interactive AI Agents

(arxiv.org)

152 points Gaishan | 2 comments | 22 Sep 25 22:02 UTC | HN request time: 0.433s | source

Show context

V__ ◴[22 Sep 25 22:23 UTC] No.45340359[source]▶

> Conventional research papers require readers to invest substantial effort to understand and adapt a paper's code, data, and methods to their own work [...]

But that's the point! If we take out the effort to understand, really understand something on a deeper level from even research, then how can there be anything useful build on top of it? Is everything going to loose any depth and become shallow?

replies(6): >>45340411 #>>45341919 #>>45342900 #>>45343183 #>>45343438 #>>45343442 #

ethin ◴[22 Sep 25 22:30 UTC] No.45340411[source]▶

>>45340359 #

Isn't this also a problem given that ChatGPT at least is bad a summarizing scientific papers[1]? Idk about Claude or Gemenai with that though. Still a problem.

Edit: spelling.

[1]: https://arstechnica.com/ai/2025/09/science-journalists-find-...

replies(1): >>45341382 #

andai ◴[23 Sep 25 00:23 UTC] No.45341382[source]▶

>>45340411 #

This study seemed to be before the reasoning models came out. With them I have the opposite problem. I ask something simple and it responds with what reads like a scientific paper.

replies(1): >>45347505 #

ijk ◴[23 Sep 25 14:25 UTC] No.45347505[source]▶

>>45341382 #

Of course "reads like" is part of the problem. The models are very good at producing something that reads like the kind of document I asked for and not as good at guaranteeing that the document has the meaning I intended.

replies(1): >>45363544 #

1. andai ◴[24 Sep 25 17:38 UTC] No.45363544[source]▶

>>45347505 #

That is true. What I meant was, I'll ask it for some practical problem I'm dealing with in my life, and it will start talking about how to model it in terms of a cybernetic system with inertia, springs and feedback loops.

Not a bad line of thinking, especially if you're microdosing, but I find myself turning off reasoning more frequently that I'd expected, considering it's supposed to be objectively better.

replies(1): >>45365345 #

2. ijk ◴[24 Sep 25 20:05 UTC] No.45365345[source]▶

>>45363544 (TP) #

I find that for more "intuitive" evaluations, reasoning tends to hurt more than it helps. In other words, if it can do a one-shot classification correctly, adding a bunch of second guessing just degrades the performance.

This may change as our RL methods get better at properly rewarding correct partial traces and penalizing overthinking, but for the moment there's often a stark difference when a multi-step process improves the model's ability to reason through the context and when it doesn't.

This is made more complicated (for human prompters and evaluators) by the fact that (as Anthropic has demonstrated) the text of the reasoning trace means something very different for the model versus how a human is interpreting it. The reasoning the model claims it is doing can sometimes be worlds away from the actual calculations (e.g., how it uses helixal structures to do addition [1]).

[1] https://openreview.net/pdf?id=CqViN4dQJk

↑