←back to thread

262 points lawrencechen | 10 comments | | HN request time: 0.545s | source | bottom

0github.com is a pull request viewer that color-codes every diff line/token by how much human attention it probably needs. Unlike PR-review bots, we try to flag not just by "is it a bug?" but by "is it worth a second look?" (examples: hard-coded secret, weird crypto mode, gnarly logic, ugly code).

To try it, replace github.com with 0github.com in any pull-request URL. Under the hood, we split the PR into individual files, and for each file, we ask an LLM to annotate each line with a data structure that we parse into a colored heatmap.

Examples:

https://0github.com/manaflow-ai/cmux/pull/666

https://0github.com/stack-auth/stack-auth/pull/988

https://0github.com/tinygrad/tinygrad/pull/12995

https://0github.com/simonw/datasette/pull/2548

Notice how all the example links have a 0 prepended before github.com. This navigates you to our custom diff viewer where we handle the same URL path parameters as github.com. Darker yellows indicate that an area might require more investigation. Hover on the highlights to see the LLM's explanation. There's also a slider on the top left to adjust the "should review" threshold.

Repo (MIT license): https://github.com/manaflow-ai/cmux

1. kburman ◴[] No.45763248[source]
It’s an interesting direction, but feels pretty expensive for what might still be a guess at what matters.

I’m not sure an LLM can really capture project-specific context yet from a single PR diff.

Honestly, a simple data-driven heatmap showing which parts of the code change most often or correlate with past bugs would probably give reviewers more trustworthy signals.

replies(5): >>45763479 #>>45764303 #>>45765157 #>>45765672 #>>45765995 #
2. lawrencechen ◴[] No.45763479[source]
Yeah this is honestly pretty expensive to run today.

> I’m not sure an LLM can really capture project-specific context yet from a single PR diff.

We had an even more expensive approach that cloned the repo into a VM and prompted codex to explore the codebase and run code before returning the heatmap data structure. Decided against it for now due to latency and cost, but I think we'll revisit it to help the LLM get project context.

Distillation should help a bit with cost, but I haven't experimented enough to have a definitive answer. Excited to play around with it though!

> which parts of the code change most often or correlate with past bugs

I can think of a way to do the correlation that would require LLMs. Maybe I'm missing a simpler approach? But agree that conditioning on past bugs would be great

replies(2): >>45763902 #>>45765217 #
3. kburman ◴[] No.45763902[source]
For the correlation idea, you might take a look at how Sentry does it, they rely mostly on stack traces, error messages, and pattern matching to map issues back to code areas. It’s cheap, scalable, and doesn’t need an LLM in the loop, which could be a good baseline before layering anything heavier on top.

As for interactive reviews, one workflow I’ve found surprisingly useful is letting Claude Code simulate a conversation between two developers pair-programming through the PR. It’s not perfect, but in practice the dialogue and clarifying questions it generates often give me more insight than a single shot LLM summary. You might find it an interesting pattern to experiment with once you revisit the more context-aware approaches.

4. nonethewiser ◴[] No.45764303[source]
A large portion of the lines of code I'm considering when I review a PR are not part of the diff. This has to be a common experience - think of how often you want to comment on a line of code or file that just isn't in the PR. It happens almost every PR for me. They materialize as lose comments, or comments on a line like "Not this line per-se but what about XYZ?" Or "you replaced this 3 places but I actually found 2 more it should be applied to."

I mean these tools are fine. But let's be on the same page that they can only address a sub-class of problems.

5. CuriouslyC ◴[] No.45765157[source]
This is not that expensive with Gemini, they give free keys that have plenty of req/day, you can upload your diff + a bundle of the relevant part of the codebase and get this behavior for free, at least for a small team with ~10-20 PR's / day. If you could run this with personal keys, anyhow.
replies(1): >>45766386 #
6. CuriouslyC ◴[] No.45765217[source]
Gemini is better than GPT5 variants for large context. Also, agents tend to be bad at gathering an optimal context set. The best approach is to intelligently select from the codebase to generate a "covering set" of everything touched in the PR, make a bundle, and fire it off at Gemini as a one shot. Because of caching, you can even fire off multiple queries to Gemini instructing it to evaluate the PR from different perspectives for cheap.
replies(1): >>45765821 #
7. ivanjermakov ◴[] No.45765672[source]
Premise is amazing. Wonder if there are tools that do something similar by looking at diff entropy.
8. lawrencechen ◴[] No.45765821{3}[source]
Yeah, adding a context gathering step is a good idea. Our original approach used codex cli in a VM, so context gathering was pretty comprehensive. We switched to a more naive approach due to latency, but having a step using a smaller model (like SWE-grep) could be a nice tradeoff.
9. cerved ◴[] No.45765995[source]
> Honestly, a simple data-driven heatmap showing which parts of the code change most often or correlate with past bugs would probably give reviewers more trustworthy signals.

At first I thought this to but now I doubt that's a good heuristic. That's probably where people would be careful and/or look anyway. If I were to guess, regressions are less likely to occur in "hotspots".

But this is just a hunch. There are tons of well reviewed and bug reported open source projects, would be interesting if someone tested it.

10. fluoridation ◴[] No.45766386[source]
Might just be me, but I understood "expensive" in terms of raw computation necessary to get the answer. Some things aren't really worth computing, even if it's someone else footing the bill.