←back to thread

121 points artski | 10 comments | | HN request time: 0.834s | source | bottom

When I came across a study that traced 4.5 million fake GitHub stars, it confirmed a suspicion I’d had for a while: stars are noisy. The issue is they’re visible, they’re persuasive, and they still shape hiring decisions, VC term sheets, and dependency choices—but they say very little about actual quality.

I wrote StarGuard to put that number in perspective based on my own methodology inspired with what they did and to fold a broader supply-chain check into one command-line run.

It starts with the simplest raw input: every starred_at timestamp GitHub will give. It applies a median-absolute-deviation test to locate sudden bursts. For each spike, StarGuard pulls a random sample of the accounts behind it and asks: how old is the user? Any followers? Any contribution history? Still using the default avatar? From that, it computes a Fake Star Index, between 0 (organic) and 1 (fully synthetic).

But inflated stars are just one issue. In parallel, StarGuard parses dependency manifests or SBOMs and flags common risk signs: unpinned versions, direct Git URLs, lookalike package names. It also scans licences—AGPL sneaking into a repo claiming MIT, or other inconsistencies that can turn into compliance headaches.

It checks contributor patterns too. If 90% of commits come from one person who hasn’t pushed in months, that’s flagged. It skims for obvious code red flags: eval calls, minified blobs, sketchy install scripts—because sometimes the problem is hiding in plain sight.

All of this feeds into a weighted scoring model. The final Trust Score (0–100) reflects repo health at a glance, with direct penalties for fake-star behaviour, so a pretty README badge can’t hide inorganic hype.

I added for the fun of it it generating a cool little badge for the trust score lol.

Under the hood, its all uses, heuristics, and a lot of GitHub API paging. Run it on any public repo with:

python starguard.py owner/repo --format markdown It works without a token, but you’ll hit rate limits sooner.

Please provide any feedback you can.

Show context
the__alchemist ◴[] No.43964589[source]
> It checks contributor patterns too. If 90% of commits come from one person who hasn’t pushed in months, that’s flagged.

IMO this is a slight green flag; not red.

replies(5): >>43964616 #>>43964685 #>>43964713 #>>43970992 #>>43971728 #
1. artski ◴[] No.43964713[source]
Fair take—it's definitely context-dependent. In some cases, solo-maintainer projects can be great, especially if they’re stable or purpose-built. But from a trust and maintenance standpoint, it’s worth flagging as a signal: if 90% of commits are from one person who’s now inactive, it could mean slow responses to bugs or no updates for security issues. Doesn’t mean the project is bad—just something to consider alongside other factors.

Heuristics are never perfect and it's all iterative but it's all about understanding the underlying assumptions and taking the knowledge you get out of it with your own context. Probably could enhance it slightly by a run through an LLM with a prompt but I prefer to keep things purely statistical for now.

replies(3): >>43964778 #>>43964815 #>>43965473 #
2. delfinom ◴[] No.43964778[source]
The problem is your audience is:

> CTOs, security teams, and VCs automate open-source due diligence in seconds.

The people that probably have less brain cells than the average programmer to understand the nuance in the flagging.

replies(1): >>43964890 #
3. 85392_school ◴[] No.43964815[source]
It could also mean that the project is stable. Since you only look at the one repository's commit activity, a stable project with a maintainer who's still active on GitHub in other places would be "less trustworthy" than a project that's a work in progress.
replies(3): >>43965467 #>>43966659 #>>43966930 #
4. artski ◴[] No.43964890[source]
Lol yeah tbh - I just made it without really thinking of an audience, just was looking for a project to work on till I saw the paper and figured it would be cool to check it out on some repositories out there. That part is just me asking gpt to make the read me better.
5. ◴[] No.43965467[source]
6. mlhpdx ◴[] No.43965473[source]
The signal here is how many unpatched vulnerabilities there are maybe multiplied by how long they’ve been out there. Purely statistical. And an actual signal.
7. artski ◴[] No.43966659[source]
Not a bad idea tbh, maybe an additional how long issues are left open, would be a good idea. Though yeh thats why I was contemplating of not necessarily highlighting the actual number and more have a range e.g. 80-100 is good, 50-70 Moderate and so on.
replies(1): >>43968225 #
8. kstrauser ◴[] No.43966930[source]
I agree. I have a popular-ish project on GitHub that I haven't touched in like a decade. I would if needed, but it's basically "done". It works. It does everything it needs to, and no one's reported a bug in many, many years.

You could etch that thing into granite as far as I can tell. The only thing left to do is rewrite it in Rust.

9. InvisGhost ◴[] No.43968225{3}[source]
Be careful with this. Each project has different practices which could lead to false positives and false negatives. You may also create the wrong incentives, depending on how you measure and report things.
replies(1): >>43992205 #
10. mary-ext ◴[] No.43992205{4}[source]
it seems worthwhile to only mention it as a sidenote rather than a negative score