(github.com)

3 points gorkemcetin | 1 comments | 27 Aug 25 00:26 UTC | HN request time: 0s | source

If you’re working with LLM training data (like I often am), you’ll know how tricky it can be to scrub out PII without breaking the dataset. I have been using MS Presidio for some time and decided to build a UI on top of it. This is a tool that scans and recognizes sensitive bits in text (eg names, emails, addresses etc), processes images to mask whats sensitive and handles structured data.

Everything is written in ts + nodejs, with great help from Claude Code :) It's still early so feedback & contributions are more than welcome.

Show context

phren0logy ◴[27 Aug 25 01:00 UTC] No.45034243[source]▶

>>45034009 (OP) #

An important issue. how does this compare with llm-guard, and the ability to create a “vault” to later de-anonymize?

replies(1): >>45034339 #

1. gorkemcetin ◴[27 Aug 25 01:16 UTC] No.45034339[source]▶

>>45034243 #

Maskwise and LLM Guard serve different stages of the AI pipeline. llm-guard basically filters prompts and responses to prevent prompt injection attacks. Maskwise is for preparing datasets before LLM training/fine-tuning. It processes large document collections (PDFs, Office docs, images) to detect & anonymize PII.

Vault is in the works :)

↑

Show HN: MaskWise: Redact, mask, and anonymize data in training files for LLMs