←back to thread

121 points artski | 1 comments | | HN request time: 0.332s | source

When I came across a study that traced 4.5 million fake GitHub stars, it confirmed a suspicion I’d had for a while: stars are noisy. The issue is they’re visible, they’re persuasive, and they still shape hiring decisions, VC term sheets, and dependency choices—but they say very little about actual quality.

I wrote StarGuard to put that number in perspective based on my own methodology inspired with what they did and to fold a broader supply-chain check into one command-line run.

It starts with the simplest raw input: every starred_at timestamp GitHub will give. It applies a median-absolute-deviation test to locate sudden bursts. For each spike, StarGuard pulls a random sample of the accounts behind it and asks: how old is the user? Any followers? Any contribution history? Still using the default avatar? From that, it computes a Fake Star Index, between 0 (organic) and 1 (fully synthetic).

But inflated stars are just one issue. In parallel, StarGuard parses dependency manifests or SBOMs and flags common risk signs: unpinned versions, direct Git URLs, lookalike package names. It also scans licences—AGPL sneaking into a repo claiming MIT, or other inconsistencies that can turn into compliance headaches.

It checks contributor patterns too. If 90% of commits come from one person who hasn’t pushed in months, that’s flagged. It skims for obvious code red flags: eval calls, minified blobs, sketchy install scripts—because sometimes the problem is hiding in plain sight.

All of this feeds into a weighted scoring model. The final Trust Score (0–100) reflects repo health at a glance, with direct penalties for fake-star behaviour, so a pretty README badge can’t hide inorganic hype.

I added for the fun of it it generating a cool little badge for the trust score lol.

Under the hood, its all uses, heuristics, and a lot of GitHub API paging. Run it on any public repo with:

python starguard.py owner/repo --format markdown It works without a token, but you’ll hit rate limits sooner.

Please provide any feedback you can.

Show context
zxilly ◴[] No.43967418[source]
Frankly, I think this program is ai generated.

1. there are hallucinatory descriptions in the Readme (make test), and also in the code, such as the rate limit set at line 158, which is the wrong number

2. all commits are done on github webui, checking the signature confirms this

3. too verbose function names and a 2000 line python file

I don't have a complaint about ai, but the code quality clearly needs improvement, the license only lists a few common examples, the thresholds for detection seem to be set randomly, _get_stargazers_graphql the entire function is commented out and performs no action, it says "Currently bypassed by get_ stargazers", did you generate the code without even reading through it?

Bad code like this gets over 100stars, it seems like you're doing a satirical fake-star performance art.

replies(2): >>43967463 #>>43967758 #
artski ◴[] No.43967758[source]
Well I initially planned to use GraphQL and started to implement it, but switched to REST for now as it's still not fully complete, just to keep things simpler while I iterate and the fact that it's not required currently. I’ll bring GraphQL back once I’ve got key cycling in place and things are more stable. As for the rate limit, I’ve been tweaking things manually to avoid hitting it constantly which I did to an extent—that’s actually why I want to add key rotation... and I am allowed to leave comments for myself for a work in progress no? or does everything have to be perfect from day one?

You would assume if it was pure ai generated it would have the correct rate limit in the comments and the code .... but honestly I don't care and yeah I ran the read me through GPT to 'prettify it'. Arrest me.

replies(1): >>43972730 #
1. bavell ◴[] No.43972730[source]
You probably should have put "v0.1" or "alpha/beta" in the post title or description - currently it reads like it's already been polished up IMO.