←back to thread

72 points henryl | 8 comments | | HN request time: 0.001s | source | bottom

Hey HN,

I’m Henry, cofounder and CTO at Span (https://span.app/). Today we’re launching AI Code Detector, an AI code detection tool you can try in your browser.

The explosion of AI generated code has created some weird problems for engineering orgs. Tools like Cursor and Copilot are used by virtually every org on the planet – but each codegen tool has its own idiosyncratic way of reporting usage. Some don’t report usage at all.

Our view is that token spend will start competing with payroll spend as AI becomes more deeply ingrained in how we build software, so understanding how to drive proficiency, improve ROI, and allocate resources relating to AI tools will become at least as important as parallel processes on the talent side.

Getting true visibility into AI-generated code is incredibly difficult. And yet it’s the number one thing customers ask us for.

So we built a new approach from the ground up.

Our AI Code Detector is powered by span-detect-1, a state-of-the-art model trained on millions of AI- and human-written code samples. It detects AI-generated code with 95% accuracy, and ties it to specific lines shipped into production. Within the Span platform, it’ll give teams a clear view into AI’s real impact on velocity, quality, and ROI.

It does have some limitations. Most notably, it only works for TypeScript and Python code. We are adding support for more languages: Java, Ruby, and C# are next. Its accuracy is around 95% today, and we’re working on improving that, too.

If you’d like to take it for a spin, you can run a code snippet here (https://code-detector.ai/) and get results in about five seconds. We also have a more narrative-driven microsite (https://www.span.app/detector) that my marketing team says I have to share.

Would love your thoughts, both on the tool itself and your own experiences. I’ll be hanging out in the comments to answer questions, too.

1. mendeza ◴[] No.45266824[source]
I feel like code fed into this detector can be manipulated to increase false positives. The model probably learns patterns that are common in generated text (clean comments, AI code always correctly formatted, AI code never makes mistakes) but if you have an AI change its code to look like code how you write (mistakes, not every function has a comment) then it can blur the line. I think this will be a great tool to get 90% of the way there, the challenge is corner cases.
replies(2): >>45266884 #>>45267036 #
2. mendeza ◴[] No.45266884[source]
I tested this idea, using ChatGPT5, I asked this prompt:

`create two 1000 line python scripts, one that is how you normally do it, and how a messy undergraduete student would write it.`

The messy script was detected as 0% chance written by AI, and the clean script 100% confident it was generated by AI. I had to shorten it for brevity. Happy to share the full script.

Here is the chatgpt convo: https://chatgpt.com/share/68c9bc0c-8e10-8011-bab2-78de5b2ed6...

clean script:

    #!/usr/bin/env python3
    """
    A clean, well-structured example Python script.

    It implements a small text-analysis CLI with neat abstractions, typing,
    dataclasses, unit-testable functions, and clear separation of concerns.
    This file is intentionally padded to exactly 1000 lines to satisfy a
    demonstration request. The padding is made of documented helper stubs.
    """
    from __future__ import annotations

    import argparse
    import json
    import re
    from collections import Counter
    from dataclasses import dataclass
    from functools import lru_cache
    from pathlib import Path
    from typing import Dict, Iterable, List, Sequence, Tuple

    __version__ = "1.0.0"

    @dataclass(frozen=True)
    class AnalysisResult:
        """Holds results from a text analysis."""
        token_counts: Dict[str, int]
        total_tokens: int

        def top_k(self, k: int = 10) -> List[Tuple[str, int]]:
            """Return the top-k most frequent tokens."""
            return sorted(self.token_counts.items(), key=lambda kv: (-kv[1], kv[0]))[:k]

    def _read_text(path: Path) -> str:
        """Read UTF-8 text from a file."""
        data = path.read_text(encoding="utf-8", errors="replace")
        return data

    @lru_cache(maxsize=128)
    def normalize(text: str) -> str:
        """Lowercase and collapse whitespace for stable tokenization."""
        text = text.lower()
        text = re.sub(r"\s+", " ", text).strip()
        return text

    def tokenize(text: str) -> List[str]:
        """Simple word tokenizer splitting on non-word boundaries."""
        return [t for t in re.split(r"\W+", normalize(text)) if t]

    def ngrams(tokens: Sequence[str], n: int) -> List[Tuple[str, ...]]:
        """Compute n-grams as tuples from a token sequence."""
        if n <= 0:
            raise ValueError("n must be positive")
        return [tuple(tokens[i:i+n]) for i in range(0, max(0, len(tokens)-n+1))]

    def analyze(text: str) -> AnalysisResult:
        """Run a bag-of-words analysis and return counts and totals."""
        toks = tokenize(text)
        counts = Counter(toks)
        return AnalysisResult(token_counts=dict(counts), total_tokens=len(toks))

    def analyze_file(path: Path) -> AnalysisResult:
        """Convenience wrapper to analyze a file path."""
        return analyze(_read_text(path))

    def save_json(obj: dict, path: Path) -> None:
        """Save a JSON-serializable object to a file with UTF-8 encoding."""
        path.write_text(json.dumps(obj, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")


Messy Script:

    # ok so this script kinda does stuff idk
    import sys,os, re, json, random, math
    from collections import \*

    VER="lol"
    g = {}
    data = []
    TMP=None

    def readz(p):
        try:
            return open(p,"r",encoding="utf-8",errors="ignore").read()
        except:
            return ""

    def norm(x):
        x=x.lower().replace("\n"," ").replace("\t"," ")
        x=re.sub(" +"," ",x)
        return x.strip()

    def tokn(x):
        x=norm(x)
        return re.split("\W+",x)

    def ana(s):
        c = Counter()
        for t in tokn(s):
            if t: c[t]+=1
        return {"counts":dict(c),"total":sum(c.values())}

    def showTop(d,k=10):
        try:
            it=list(d["counts"].items())
            it.sort(key=lambda z:(-z[1],z[0]))
            for a,b in it[:k]:
                print(a+"\t"+str(b))
        except:
            print("uhh something broke")

    def main():
        # not really parsing args lol
        if len(sys.argv)<2:
            print("give me a path pls")
            return 2
        p=sys.argv[1]
        t=readz(p)
        r=ana(t)
        showTop(r,10)
        if "--out" in sys.argv:
            try:
                i=sys.argv.index("--out"); o=sys.argv[i+1]
            except:
                o="out.json"
            with open(o,"w",encoding="utf-8") as f:
                f.write(json.dumps(r))
        return 0

    if __name__=="__main__":
        # lol
        main()

    def f1(x=None,y=0,z="no"):
        # todo maybe this should do something??
        try:
            if x is None:
                x = y
            for _ in range(3):
                y = (y or 0) + 1
            if isinstance(x,str):
                return x[:5]
            elif isinstance(x,int):
                return x + y
            else:
                return 42
        except:
            return -1

    def f2(x=None,y=0,z="no"):
        # todo maybe this should do something??
        try:
            if x is None:
                x = y
            for _ in range(3):
                y = (y or 0) + 1
            if isinstance(x,str):
                return x[:5]
            elif isinstance(x,int):
                return x + y
            else:
                return 42
        except:
            return -1

    def f3(x=None,y=0,z="no"):
        # todo maybe this should do something??
        try:
            if x is None:
                x = y
            for _ in range(3):
                y = (y or 0) + 1
            if isinstance(x,str):
                return x[:5]
            elif isinstance(x,int):
                return x + y
            else:
                return 42
replies(2): >>45267144 #>>45267277 #
3. bbsbb ◴[] No.45267036[source]
This is a spot on observation, the most challenging so far to detect appears to be code produced via tooling usage that is slightly ahead of the overall curve in adoption and practices. I am not sold though that those aren't detectable holistically, but there certainly isn't enough similarity or an easily reproducible dataset where I would call the task easy. We are not certain what the next models hold for the future, but if we assume there is a huge current investment from all the companies in terms of quality code output, it is possible there is still convergence to something detectable.
4. johnsillings ◴[] No.45267144[source]
That's a great question + something we've discussed internally a bit. We suspect it is possible to "trick" the model with a little effort (like you did above) but it's not something we're particularly focused on.

The primary use-case for this model is for engineering teams to understand the impact of AI-generated code in production code in their codebases.

replies(2): >>45267344 #>>45267793 #
5. nomel ◴[] No.45267277[source]
On HN, indent four spaces for code block, blank line between and text above.

    Like
    This
replies(1): >>45267373 #
6. mendeza ◴[] No.45267344{3}[source]
I agree this would be a great tool for organizations to use to see impact of AI code in codebases. Engineers will probably be too lazy to modify the code enough to make it look less AI. You could probably enhance the robustness of your classifier with synthetic data like this.

I think it would be an interesting research project to detect if someone is manipulating AI generated code to look more messy. This paper https://arxiv.org/pdf/2303.11156 Sadasivan et. al. proved that detectors are bounded by the total variation distance between two distributions. If two distributions are truly the same, then the best you can do is random guessing. The trends with LLMs (via scaling laws) are going towards this direction, so a question is as models improve, will they be indistinguishable from human code.

Be fun to collaborate!

7. mendeza ◴[] No.45267373{3}[source]
I appreciate the feedback! I just updated to have the 4 space indentation.
8. runako ◴[] No.45267793{3}[source]
The primary point of distinction that allows AI generation to be inferred appears to be that the code is clean and well-structured. (Leave aside for a moment the oddity that this is all machines whose primary benchmarks are human-generated code written in a style that is now deemed too perfect to have been written by people.)

Does that provide an incentive for people writing manually to write worse code, structured badly, as proof that they didn't use AI to generate their code?

Is there now a disincentive for writing good code with good comments?