←back to thread

27 points roggenbuck | 2 comments | | HN request time: 0s | source

I wanted a safer alternative to RegExp for TypeScript that uses a linear-time engine, so I built Regolith.

Why: Many CVEs happen because TypeScript libraries are vulnerable to Regular Expression Denial of Service attacks. I learned about this problem while doing undergraduate research and found that languages like Rust have built-in protection but languages like JavaScript, TypeScript, and Python do not. This library attempts to mitigate these vulnerabilities for TypeScript and JavaScript.

How: Regolith uses Rust's Regex library under the hood to prevent ReDoS attacks. The Rust Regex library implements a linear-time Regex engine that guarantees linear complexity for execution. A ReDoS attack occurs when a malicious input is provided that causes a normal Regex engine to check for a matching string in too many overlapping configurations. This causes the engine to take an extremely long time to compute the Regex, which could cause latency or downtime for a service. By designing the engine to take at most a linear amount of time, we can prevent these attacks at the library level and have software inherit these safety properties.

I'm really fascinated by making programming languages safer and I would love to hear any feedback on how to improve this project. I'll try to answer all questions posted in the comments.

Thanks! - Jake Roggenbuck

Show context
semiquaver ◴[] No.45035198[source]

  > Regolith attempts to be a drop-in replacement for RegExp and requires minimal (to no) changes to be used instead
vs

  > Since Regolith uses Rust bindings to implement the Rust Regex library to achieve linear time worst case, this means that backreferences and look-around aren't available in Regolith either.
Obviously it cannot be a drop-in replacement if the regex dialect differs. That it has a compatible API is not the only relevant factor. I’d recommend removing the top part from the readme.

Another thought: since backreferences and lookaround are the features in JS regexes which _cause_ ReDOS, why not just wrap vanilla JS regex, rejecting patterns including them? Wouldn’t that achieve the same result in a simpler way?

replies(4): >>45035253 #>>45035264 #>>45035460 #>>45035828 #
1. bbor ◴[] No.45035828[source]
Totally agree -- those are two incredibly useful features of regex[1][2] that are often effectively irreplaceable. I could see this being a straightforward tradeoff for applications that know for sure they don't need complex regexes but still must accept patterns written by the client for some reason(?), but otherwise this seems like a hell of a way to go to replace a `timeout` wrapper.

This paragraph in particular seems very wholesome, but misguided in light of the tradeoff:

  Having a library or project that is immune to these vulnerabilities would save this effort for each project that adopted it, and would save the whole package ecosystem that effort if widely adopted.
Honestly, the biggest shock here for me is that Rust doesn't support these. Sure, Python has yet to integrate the actually-functional `regex`[3] into stdlib to replace the dreadfully under-specced `re`, but Rust is the new kid on the block! I guess people just aren't writing complex regexes anymore...[4]

RE:simpler wrapper, I personally don't see any reason it wouldn't work, and dropping a whole language seems like a big win if it does. I happened to have some scaffolding on hand for the cursed, dark art of metaregexes, so AFAICT, this pattern would work for a blanket ban: https://regexr.com/8gplg Ironically, I don't think there's a way to A) prevent false-positives on triple-backslashes without using lookarounds, or B) highlight the offending groups in full without backrefs!

[1] https://www.regular-expressions.info/backref.html

[2] https://www.regular-expressions.info/lookaround.html

[3] https://github.com/mrabarnett/mrab-regex

[4] We need a regex renaissance IMO, though the feasibility of "just throw a small fine-tuned LLM at it" may delay/obviate that for users that can afford the compute... It's one of the OG AI concepts, back before intuition seemed possible!

replies(1): >>45038925 #
2. burntsushi ◴[] No.45038925[source]
> Honestly, the biggest shock here for me is that Rust doesn't support these.

It's likely a shock because you over-estimate their utility:

> those are two incredibly useful features of regex that are often effectively irreplaceable.

Tons of people are using the `regex` crate in the Rust ecosystem. Tons use RE2 with C++. And tons use the standard library `regexp` package with Go. If all of these libraries were lacking actually "irreplaceable" features, I don't think they would be so widely used. So I think, empirically, you overstate things here.

They are of course undeniably useful features, and you don't need them to write complex regexes. The fact of the matter is that a lot (not all) of uses of lookaround or backreferences can be replaced with either careful use of capture groups or a second regex.

The place where one might really feel the absence of these regex features is when regexes are used as the interface to something.

Besides, if you need those extra features in the Rust ecosystem, you can just use `fancy-regex`[1]. It's built on top of the `regex` crate.

[1]: https://crates.io/crates/fancy-regex