Show HN: Regolith – Regex library that prevents ReDoS CVEs in TypeScript

(github.com)

I wanted a safer alternative to RegExp for TypeScript that uses a linear-time engine, so I built Regolith.

Why: Many CVEs happen because TypeScript libraries are vulnerable to Regular Expression Denial of Service attacks. I learned about this problem while doing undergraduate research and found that languages like Rust have built-in protection but languages like JavaScript, TypeScript, and Python do not. This library attempts to mitigate these vulnerabilities for TypeScript and JavaScript.

How: Regolith uses Rust's Regex library under the hood to prevent ReDoS attacks. The Rust Regex library implements a linear-time Regex engine that guarantees linear complexity for execution. A ReDoS attack occurs when a malicious input is provided that causes a normal Regex engine to check for a matching string in too many overlapping configurations. This causes the engine to take an extremely long time to compute the Regex, which could cause latency or downtime for a service. By designing the engine to take at most a linear amount of time, we can prevent these attacks at the library level and have software inherit these safety properties.

I'm really fascinated by making programming languages safer and I would love to hear any feedback on how to improve this project. I'll try to answer all questions posted in the comments.

Thanks! - Jake Roggenbuck

1. spankalee ◴[27 Aug 25 03:20 UTC] No.45035123[source]▶

>>45034957 (OP) #

It's very, very weird to speak of TypeScript and JavaScript as two separate languages here.

There is no TypeScript RegExp, there is only the JavaScript RegExp as implemented in various VMs. There is no TypeScript VM, only JavaScript VMs. And there are no TypeScript CVEs unless it's against the TypeScript compiler, language server, etc.

replies(3): >>45035278 #>>45035730 #>>45036911 #

2. xyzzy123 ◴[27 Aug 25 03:34 UTC] No.45035193[source]▶

>>45034957 (OP) #

It's great to have a safe options - and it would have been great if the default had been safe.

I think many people are annoyed with ReDos as a bug class. It seems like mostly noise in the CVE trackers, library churn and badge collecting for "researchers". It'd be less of a problem if people stuck to filing CVEs against libraries that might remotely see untrusted input rather than scrambling to collect pointless "scalps" from every tool under the sun that accepts a configuration regex - build tools, very commonly :(

Perhaps you can stop this madness... :)

replies(2): >>45035268 #>>45035281 #

3. semiquaver ◴[27 Aug 25 03:35 UTC] No.45035198[source]▶

>>45034957 (OP) #

  > Regolith attempts to be a drop-in replacement for RegExp and requires minimal (to no) changes to be used instead

  > Since Regolith uses Rust bindings to implement the Rust Regex library to achieve linear time worst case, this means that backreferences and look-around aren't available in Regolith either.

Obviously it cannot be a drop-in replacement if the regex dialect differs. That it has a compatible API is not the only relevant factor. I’d recommend removing the top part from the readme.

Another thought: since backreferences and lookaround are the features in JS regexes which _cause_ ReDOS, why not just wrap vanilla JS regex, rejecting patterns including them? Wouldn’t that achieve the same result in a simpler way?

replies(4): >>45035253 #>>45035264 #>>45035460 #>>45035828 #

4. bawolff ◴[27 Aug 25 03:47 UTC] No.45035253[source]▶

>>45035198 #

> Another thought: since backreferences and lookaround are the features in JS regexes which _cause_ ReDOS,

This is incorrect. Other features can cause ReDOS.

The other problematic features have linear time algorithms that could be used, but generally are not used (i assume for better average case performance)

replies(2): >>45035309 #>>45035608 #

5. roggenbuck ◴[27 Aug 25 03:48 UTC] No.45035264[source]▶

>>45035198 #

Thanks for the feedback! Yea, you're totally right. I'll update the docs to reflect this.

> why not just wrap vanilla JS regex, rejecting patterns including them?

Yea! I was thinking about this too actually. And this would solve the problem of being server side only. I'm thinking about making a new version to do just this.

For a pattern rejecting wrapper, how would you want it to communicate that an unsafe pattern has been created.

replies(2): >>45036058 #>>45037162 #

6. bawolff ◴[27 Aug 25 03:49 UTC] No.45035268[source]▶

>>45035193 #

Even in cases where malicious input could be hit, this bug class is stupid on the client side where the attacker can only attack themselves.

replies(1): >>45035346 #

7. serial_dev ◴[27 Aug 25 03:50 UTC] No.45035278[source]▶

>>45035123 #

I was also confused first, I thought it is against the TypeScript compiler, too.

8. roggenbuck ◴[27 Aug 25 03:50 UTC] No.45035281[source]▶

>>45035193 #

> and it would have been great if the default had been safe.

I totally agree here. Safety can and should be from the language itself.

9. roggenbuck ◴[27 Aug 25 03:57 UTC] No.45035309{3}[source]▶

>>45035253 #

Yea, I can expand the description to include other features that may cause issues. Here is an example of how counting can cause latency too: https://www.usenix.org/system/files/sec22fall_turonova.pdf

replies(1): >>45036103 #

10. xyzzy123 ◴[27 Aug 25 04:04 UTC] No.45035346{3}[source]▶

>>45035268 #

Stored... ReDoS, reflected... ReDoS(??)... [it pained me to type those] (╯°□°)╯︵ ┻━┻

11. btown ◴[27 Aug 25 04:28 UTC] No.45035460[source]▶

>>45035198 #

As someone who's been saved by look-aheads in many a situation, I'm quite partial to the approach detailed in [0]: use a regex library that checks for a timeout in its main matching loop.

This lets you have full backwards compatibility in languages like Python and JS/TS that support backreferences/lookarounds, without running any risk of DOS (including by your own handrolled regexes!)

And on modern processors, a suitably implemented check for a timeout would largely be branch-predicted to be a no-op, and would in theory result in no measurable change in performance. Unfortunately, the most optimized and battle-tested implementations seem to have either taken the linear-time NFA approaches, or have technical debt making timeout checks impractical (see comment in [0] on the Python core team's resistance to this) - so we're in a situation where we don't have the best of both worlds. Efforts like [1] are promising, especially if augmented with timeout logic, but early-stage.

[0] https://stackoverflow.com/a/74992735

[1] https://github.com/fancy-regex/fancy-regex

12. thomasmg ◴[27 Aug 25 04:57 UTC] No.45035608{3}[source]▶

>>45035253 #

Right. An example regex that can be slow is CSV parsing [1]:

.*,.*,.*,.*,.* etc.

I believe a timeout is a better (simpler) solution than to try to prevent 'bad' patterns. I use this approach in my own (tiny, ~400 lines) regex library [2]. I use a limit at most ~100 operations per input byte. So, without measuring wall clock time, which can be inaccurate.

[1]: https://stackoverflow.com/questions/2667015/is-regex-too-slo... [2]: https://github.com/thomasmueller/bau-lang/blob/main/src/test...

replies(1): >>45044056 #

13. dwoldrich ◴[27 Aug 25 05:11 UTC] No.45035690[source]▶

>>45034957 (OP) #

Perhaps regex is just a bad little language for pattern matching.

I have a foggy recollection of compute times exploding for me on a large regex in .Net code and I used a feature I hadn't seen in JavaScript's RegExp that allowed me to mark off sections of already matched parts of the regular expression that prevented it from backtracking.

Perhaps the answer isn't removing features for linear regex, but adding more features to make it more expressive and tunable?

14. maxloh ◴[27 Aug 25 05:22 UTC] No.45035730[source]▶

>>45035123 #

Deno and Node use V8 under the hood, so the code should essentially run on the same VM regardless.

15. bbor ◴[27 Aug 25 05:44 UTC] No.45035828[source]▶

>>45035198 #

Totally agree -- those are two incredibly useful features of regex[1][2] that are often effectively irreplaceable. I could see this being a straightforward tradeoff for applications that know for sure they don't need complex regexes but still must accept patterns written by the client for some reason(?), but otherwise this seems like a hell of a way to go to replace a `timeout` wrapper.

This paragraph in particular seems very wholesome, but misguided in light of the tradeoff:

  Having a library or project that is immune to these vulnerabilities would save this effort for each project that adopted it, and would save the whole package ecosystem that effort if widely adopted.

Honestly, the biggest shock here for me is that Rust doesn't support these. Sure, Python has yet to integrate the actually-functional `regex`[3] into stdlib to replace the dreadfully under-specced `re`, but Rust is the new kid on the block! I guess people just aren't writing complex regexes anymore...[4]

RE:simpler wrapper, I personally don't see any reason it wouldn't work, and dropping a whole language seems like a big win if it does. I happened to have some scaffolding on hand for the cursed, dark art of metaregexes, so AFAICT, this pattern would work for a blanket ban: https://regexr.com/8gplg Ironically, I don't think there's a way to A) prevent false-positives on triple-backslashes without using lookarounds, or B) highlight the offending groups in full without backrefs!

[1] https://www.regular-expressions.info/backref.html

[2] https://www.regular-expressions.info/lookaround.html

[3] https://github.com/mrabarnett/mrab-regex

[4] We need a regex renaissance IMO, though the feasibility of "just throw a small fine-tuned LLM at it" may delay/obviate that for users that can afford the compute... It's one of the OG AI concepts, back before intuition seemed possible!

replies(1): >>45038925 #

16. DemocracyFTW2 ◴[27 Aug 25 06:24 UTC] No.45036058{3}[source]▶

>>45035264 #

> how would you want it to communicate that an unsafe pattern has been created

Given this is running on a JS engine, an error should be thrown much as an error will be thrown on syntactically invalid regexes in the source. Sadly, this can't happen a module load / compile time unless a build step is implemented, complicating the matter; but on the other hand, a regex that is never used can also not be a problem. The build step could be stupidly simple, such as relying on an otherwise disallowed construction like `safe/[match]*me/`.

17. DemocracyFTW2 ◴[27 Aug 25 06:29 UTC] No.45036087[source]▶

>>45034957 (OP) #

FWIW there's also https://github.com/slevithan/regex "JS regexes future. A template tag for readable, high-performance, native JS regexes with extended syntax, context-aware interpolation, and always-on best practices". From the docs:

Highlights include support for insignificant whitespace and comments, atomic groups and possessive quantifiers (that can help you avoid ReDoS), subroutines and subroutine definition groups (that enable powerful subpattern composition), and context-aware interpolation of regexes, escaped strings, and partial patterns.

18. thomasmg ◴[27 Aug 25 06:31 UTC] No.45036103{4}[source]▶

>>45035309 #

A static analysis of the regular expression has the advantage that many problematic cases can be caught at compile time. Not all: the expression is sometimes generated at runtime. There's also a risk that too many cases might be rejected.

Did you consider a hybrid approach, where static analysis is used to get compiler warnings / errors, combined with limiting the number of operations at runtime? An API change might be needed, so instead of just "matches(regex)" a new method might be needed with a limit "matches(regex, opCountLimit)" and a different return type (true / false / timeout).

19. DemocracyFTW2 ◴[27 Aug 25 08:32 UTC] No.45036911[source]▶

>>45035123 #

This bothered me too; I think it should be re-framed as a difference between different JS engines (NodeJS vs Deno) because that's what happening. TS is just a fancy way to write JavaScript after all and a lot of TS source code gets literally erased in order to obtain runnable JS. Deno has been able to execute pure JavaScript from day one while NodeJS can now also execute a subset of TS without a visible translation step, all of which makes framing NodeJS vs Deno as "JavaScript vs TypeScript" even weirder.

20. ◴[27 Aug 25 08:47 UTC] No.45037000[source]▶

>>45034957 (OP) #

21. DemocracyFTW2 ◴[27 Aug 25 08:51 UTC] No.45037026[source]▶

>>45034957 (OP) #

I have another nitpick and I hope it's a constructive one.

You provide some performance figures; unfortunately they are caught in an image, no doubt to enable color-coding the results. IMHO that's not ideal, tables should be pure text, even if only for accessibility with screen readers. There are other means to provide guiding highlights, like red and green Unicode code points. GitHub is somewhat unique in its strict policy to remove almost any kind of user-side styling from the READMEs, but providing a "photo snapshot" of parts of the README just to get some colors does not feel like the right solution.

Next thing are the actual figures you provide: those range from 11.822µs (best) to 56.534s (worst). They are displayed as

    11.822µs
    56.534s

making them look almost like the worst performer took around five times as long as the best performer—until you realize there's a mu in there.

I must say that personally I remove this so-called "human-readable" format almost wherever I can because I find it not human-readable at all. To me a good numerical display should try and keep the decimal points on top of each other, avoid too many non-significant digits, use digit grouping, and, crucially, use a single unit throughout. With those constraints, the two figures become

            11.8µs
    56,534,000.0µs

which incidentally obviates much of the need to color code anything. One could discuss what unit—ns, µs, ms, s—is the most appropriate in the given context but, generally, I feel that big numbers should stand out as having many digits.

Nobody will pick this up because it's much too elaborate and idiosyncratic for this conformist world, but I just love the 'Japanese' way of formatting where you do digit grouping with the SI prefixes, so one hundred and twenty-five meters is 125m, but one thousand one hundred and twenty-five meters doesn't become 1,125m, nor is it 1.125km, but rather 1k125m (preferrably with a thin space as in 1k_125m—imagine a thin non-breakable space there that HN wouldn't let me render).

1G 255M 368k 799B, what's not to like?

22. truth_seeker ◴[27 Aug 25 08:55 UTC] No.45037040[source]▶

>>45034957 (OP) #

Magic-RegExp aims to create a compiled away, type-safe, readable RegEx alternative that makes the process a lot easier. https://blog.logrocket.com/understanding-magic-regexp-regexp...

example from blog:

import { createRegExp, exactly, wordChar, oneOrMore, anyOf, } from "magic-regexp";

const regExp = createRegExp(

  exactly("http")

    .and(exactly("s").optionally())

    .and("://")

    .optionally()

    .and(exactly("www.").optionally())

    .and(oneOrMore(wordChar))

    .and(exactly("."))

    .and(anyOf("com", "org", "io")),

  ["g", "m", "i"]

);

console.log(regExp);

/(https?:\/\/)?(www\.)?\w+\.(com|org|io)/gmi

23. 0points ◴[27 Aug 25 09:14 UTC] No.45037162{3}[source]▶

>>45035264 #

> And this would solve the problem of being server side only.

Server-side?

You should look into how you compile your rust into wasm

24. conartist6 ◴[27 Aug 25 11:14 UTC] No.45038039[source]▶

>>45034957 (OP) #

FWIW JS has a library-layer non-backtracking regex implementation in @bablr/regex. It's still a slightly different dialect of regex than native JS regex which (currently) lacks lookahead and lookbehind support. It's not going to have Rust-y perf, but it shouldn't get ReDoS'd either.

25. burntsushi ◴[27 Aug 25 12:48 UTC] No.45038925{3}[source]▶

>>45035828 #

> Honestly, the biggest shock here for me is that Rust doesn't support these.

It's likely a shock because you over-estimate their utility:

> those are two incredibly useful features of regex that are often effectively irreplaceable.

Tons of people are using the `regex` crate in the Rust ecosystem. Tons use RE2 with C++. And tons use the standard library `regexp` package with Go. If all of these libraries were lacking actually "irreplaceable" features, I don't think they would be so widely used. So I think, empirically, you overstate things here.

They are of course undeniably useful features, and you don't need them to write complex regexes. The fact of the matter is that a lot (not all) of uses of lookaround or backreferences can be replaced with either careful use of capture groups or a second regex.

The place where one might really feel the absence of these regex features is when regexes are used as the interface to something.

Besides, if you need those extra features in the Rust ecosystem, you can just use `fancy-regex`[1]. It's built on top of the `regex` crate.

[1]: https://crates.io/crates/fancy-regex

26. bawolff ◴[27 Aug 25 19:35 UTC] No.45044056{4}[source]▶

>>45035608 #

PHP tended towards this approach too. It did lead to security vulns though where people interpreted a timeout the same as not matching, so attackers made the input complicated to skip the security check (part of this is on php for making the difference between timeout and no match be null vs false, instead of just throwing an exception)