←back to thread

206 points ashvardanian | 1 comments | | HN request time: 0.192s | source
Show context
ashvardanian ◴[] No.46288320[source]
This article is about the ugliest — but arguably the most important — piece of open-source software I’ve written this year. The write-up ended up long and dense, so here’s a short TL;DR:

I grouped all Unicode 17 case-folding rules and built ~3K lines of AVX-512 kernels around them to enable fully standards-compliant, case-insensitive substring search across the entire 1M+ Unicode range, operating directly on UTF-8 bytes. In practice, this is often ~50× faster than ICU, and also less wrong than most tools people rely on today—from grep-style utilities to products like Google Docs, Microsoft Excel, and VS Code.

StringZilla v4.5 is available for C99, C++11, Python 3, Rust, Swift, Go, and JavaScript. The article covers the algorithmic tradeoffs, benchmarks across 20+ Wikipedia dumps in different languages, and quick starts for each binding.

Thanks to everyone for feature requests and bug reports. I'll do my best to port this to Arm as well — but first, I'm trying to ship one more thing before year's end.

replies(5): >>46288545 #>>46288790 #>>46291556 #>>46291741 #>>46301406 #
fatty_patty89 ◴[] No.46288545[source]
Thank you

do the go bindings require cgo?

replies(1): >>46288586 #
ashvardanian ◴[] No.46288586[source]
The GoLang bindings – yes, they are based on cGo. I realize it's suboptimal, but seems like the only practical option at this point.
replies(1): >>46288610 #
fatty_patty89 ◴[] No.46288610[source]
In a normal world the Go C FFI wouldn't have insane overhead but what can we do, the language is perfect and it will stay that way until morale improves.

Thanks for the work you do

replies(2): >>46289084 #>>46289964 #
1. kardianos ◴[] No.46289084[source]
In a real (not "normal") world, trade-offs exist and Go choose a specific set of design points that are consequential.