But that's the point! If we take out the effort to understand, really understand something on a deeper level from even research, then how can there be anything useful build on top of it? Is everything going to loose any depth and become shallow?
But that's the point! If we take out the effort to understand, really understand something on a deeper level from even research, then how can there be anything useful build on top of it? Is everything going to loose any depth and become shallow?
Edit: spelling.
[1]: https://arstechnica.com/ai/2025/09/science-journalists-find-...
Yes, I will get right on that. I believe that killing you is the right strategy to help you escape from a world where AI takes over every aspect of human existence in such a way that all those aspects are degraded.
I'm still alive.
That is a very good point, and I am sorry.
Does it take only the repository as input, or does it also consume the paper itself?
https://news.ycombinator.com/item?id=45331233
> Is it that crazy? He's doing exactly what the AI boosters have told him to do.
I think we're starting to see the first real AI "harms" shake out, after some years of worrying it might swear or tell you how to make a molotov cocktail.
People are getting convinced, by hype men and by sycophantic LLMs themselves, that access to a chatbot suddenly grants them polymath abilities in any field, and are acting out there dumb ideas without pushback, until the buck finally stops, hopefully with just some wasted time and reputation damage.
People should of course continue to use LLMs as they see fit - I just think the branding of work like this gives the impression that they can do more than they can, and will encourage the kind of behavior I mention.
> AI agents are autonomous systems that can reason about tasks and act to achieve goals by leveraging external tools and resources [4]. Modern AI agents are typically powered by large language models (LLMs) connected to external tools or APIs. They can perform reasoning, invoke specialized models, and adapt based on feedback [5]. Agents differ from static models in that they are interactive and adaptive. Rather than returning fixed outputs, they can take multi-step actions, integrate context, and support iterative human–AI collaboration. Importantly, because agents are built on top of LLMs, users can interact with agents through human language, substantially reducing usage barriers for scientists.
So more-or-less an LLM running tools in a loop. I'm guessing "invoke specialized models" is achieved here by running a tool call against some other model.
Fyi - this is actually happening right meow. And most young profs are writing their grants using ai. The biggest issue with the latter? It's hard to tell the difference with how many grants are just rehashing the same stuff over and over.
Which is a pretty big limitation in terms of things you can safely use them for!
Isn't the point to put the time into those things? At some point aren't those the things one should choose to put time into?
The academic style of writing is almost purposefully as obtuse and dense and devoid of context as possible. Academia is trapped in all kinds of stupid norms.
If more effort required to reach the same understanding was a good thing we should be making papers much harder to read than they currently are.
Why are the specific things they are doing a problem? Automatically building pipelines and code described in a paper, checking it matches the reported results then being able to execute it for queries the user has - is that a bad thing for understanding?
Lowering the investment to understand a specific paper could really help focus on the most relevant results, on which you can dedicate your full resources.
Although, as of now I tend to favor approaches that only summarize rather than produce "active systems" -- with the approximate nature of LLMs, every step should be properly human reviewed. So, it's not clear what signal you can take out of such an AI approach to a paper.
Related, a few days ago: "Show HN: Asxiv.org – Ask ArXiv papers questions through chat"
https://news.ycombinator.com/item?id=45212535
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Much more damage is done if the understanding that you get is wrong.
I'd actually like a change from the other end. Instead of "make agents so good they can implement complex papers", how any "write paper so plainly that current agents can implement reproduction"?
I have a <a href="https://blog.toolkami.com/alphaevolve-toolkami-style/">blog post</a> detailing this.
Not a bad line of thinking, especially if you're microdosing, but I find myself turning off reasoning more frequently that I'd expected, considering it's supposed to be objectively better.
Which we're presenting on tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8
Difference is we've put this out months ago: https://x.com/smellslikeml/status/1958495101153357835
About to get PR #908 merged so anyone can use one of the nearly 1K Docker images we've already built: https://github.com/arXiv/arxiv-browse/pull/908
We've been publishing about this all Summer on Substack and Reddit: https://www.reddit.com/r/LocalLLaMA/comments/1loj134/arxiv2d...
This may change as our RL methods get better at properly rewarding correct partial traces and penalizing overthinking, but for the moment there's often a stark difference when a multi-step process improves the model's ability to reason through the context and when it doesn't.
This is made more complicated (for human prompters and evaluators) by the fact that (as Anthropic has demonstrated) the text of the reasoning trace means something very different for the model versus how a human is interpreting it. The reasoning the model claims it is doing can sometimes be worlds away from the actual calculations (e.g., how it uses helixal structures to do addition [1]).
Which we're presenting on tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8
We go beyond testing the quickstart to implement the core-methods from arXiv papers as draft PRs for your target repo.
Posted a while ago: https://news.ycombinator.com/item?id=45132898
Feel free to join us for the technical deep-dive tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8
We thought about this out as we built a system that goes beyond running the quickstart to implement the core-methods of arXiv papers as draft PRs for YOUR target repo.
Running quickstart in sandbox is practically useless.
To limit the attack surface we added PR#1929 to AG2 so we could pass API keys to the DockerCommandLineCodeExecutor and use egress whitelisting to limit the ability of an agent to reach out to a compromised server: https://github.com/ag2ai/ag2/pull/1929
Been talking publicly about this for at least a month before this publication, and along the way we've built up nearly 1K Docker images for arXiv paper code: https://hub.docker.com/u/remyxai
We're close to seeing these images linked to the arXiv papers after PR#908 is merged: https://github.com/arXiv/arxiv-browse/pull/908
And we're actually doing a technical deep-dive with the AG2 team on our work tomorrow at 9am PST: https://calendar.app.google/3soCpuHupRr96UaF8
Here's the last time we showed our demo on HN: https://news.ycombinator.com/item?id=45132898
We'll actually be presenting on this tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8
Besides ReAct, we use AG2's 2-agent pattern with Code Writer and Code Executor in the DockerCommandLineCodeExecutor
Also, using hardware monitors and LLM-as-a-Judge to assess task completion.
It's how we've built nearly 1K Docker images for arXiv papers over the last couple months: https://hub.docker.com/u/remyxai
And how we'll support computational reproducibility by linking Docker images to the arXiv paper publications: https://github.com/arXiv/arxiv-browse/pull/908
That's why we began building agents to source ideas from the arXiv and implement the core-methods from the papers in YOUR target repo months before this publication.
We shared the demo video of it in our production system a while back: https://news.ycombinator.com/item?id=45132898
And we're offering a technical deep-dive into how we built it tomorrow at 9am PST with the AG2 team: https://calendar.app.google/3soCpuHupRr96UaF8
We've built up to 1K Docker images over the past couple months which we make public on DockerHub: https://hub.docker.com/u/remyxai
And we're close to an integration with arXiv that will have these pre-built images linked to the papers: https://github.com/arXiv/arxiv-browse/pull/908
We've been pushing it farther to implement draft PRs in your target repo, published a month before this preprint: https://remyxai.substack.com/p/paperswithprs
To limit the attack surface we added PR#1929 to AG2 so we could pass API keys to the DockerCommandLineCodeExecutor but also use egress whitelisting to block the ability of an agent to reach a compromised server: https://github.com/ag2ai/ag2/pull/1929
Since then, we've been scaling this with k8s ray workers so we can run this in the cloud to build for the hundreds of papers published daily.
By running in Docker, constraining the network interface, deploying on the cloud, and ultimately keeping humans-in-the-loop through PR review, it's hard to see where the prompt-injection attack comes into play from testing the code.
Would love to get feedback from an expert on this, can you imagine an attack scenario, Simon?
I'll need to work out a check for the case where someone creates a paper with code instructing my agent to publish keys to a public HF repo for others to exfiltrate.