←back to thread

Show HN: Greenmask 0.2 – Database anonymization tool

(github.com)

94 points woyten | 1 comments | 16 Oct 24 20:37 UTC | HN request time: 0.443s | source

Hi! My name is Vadim, and I’m the developer of Greenmask (https://github.com/GreenmaskIO/greenmask). Today Greenmask is almost 1 year and recently we published one of the most significant release with new features: https://github.com/GreenmaskIO/greenmask/releases/tag/v0.2.0, as well as a new website at https://greenmask.io.

Before I describe Greenmask’s features, I want to share the story of how and why I started implementing it.

Everyone strives to have their staging environment resemble production as closely as possible because it plays a critical role in ensuring quality and improving metrics like time to delivery. To achieve this, many teams have started migrating databases and data from production to staging environments. Obviously this requires anonymizing the data, and for this people use either custom scripts or existing anonymization software.

Having worked as a database engineer for 8 years, I frequently struggled with routine tasks like setting up development environments—this was a common request. Initially, I used custom scripts to handle this, but things became increasingly complex as the number of services grew, especially with the rise of microservices architecture.

When I began exploring tools to solve this issue, I listed my key expectations for such software: documentation; type safety (the tool should validate any changes to the data); streaming (I want the ability to stream the data while transformations are being applied); consistency (transformations must maintain constraints, functional dependencies, and more); reliability; customizability; interactivity and usability; simplicity.

I found a few options, but none fully met my expectations. Two interesting tools I discovered were pganonymizer and replibyte. I liked the architecture of Replibyte, but when I tried it, it failed due to architectural limitations.

With these thoughts in mind, I began developing Greenmask in mid-2023. My goal was to create a tool that meets all of these requirements, based on the design principles I laid out. Here are some key highlights:

* It is a single utility - Greenmask delegates the schema dump to vendor utilities and takes responsibility only for data dumps and transformations.

* Database Subset (https://docs.greenmask.io/latest/database_subset) - specify the subset condition and scale down size. We did a deep research in graph algorithms and now we can subset almost any complexity of database.

* Database type safety - it uses the DBMS driver to decode and encode data into real types (such as int, float, etc.) in the stream. This guarantees consistency and almost eliminates the chance of corrupted dumps.

* Deterministic engine (https://docs.greenmask.io/latest/built_in_transformers/trans...) - generate data using the hash engine that produces consistent output for the same input.

* Dynamic parameters for transformers (https://docs.greenmask.io/latest/built_in_transformers/dynam...) - imagine having created_at and updated_at dates with functional dependencies. Dynamic parameters ensure these dates are generated correctly.

We are actively maintaining the current project and continuously improving it—our public roadmap at https://github.com/orgs/GreenmaskIO/projects/6. Soon, we will release a Python library along with transformation collections to help users develop their own complex transformations and integrate them with any service. We have plans to support more database engines, with MySQL being the next one, and we are working on tools which will integrate seamlessly with your CI/CD systems.

To get started, we’ve prepared a playground for you that can be easily set up using Docker Compose: https://docs.greenmask.io/latest/playground/

I’d love to hear any questions or feedback from you!

Show context

gregw2 ◴[18 Oct 24 18:02 UTC] No.41881895[source]▶

>>41863600 (OP) #

Free feedback for you, nothing personal...

Within 48 hours of needing to do some masking (exporting data from a database to CSV to import it to a less trusted domain's database for testing/benchmarking), I ran across your HN post and tool.

A pair of us had started with a different masking approach I knew was dumb but would work, but I mentioned Greenmask to the guy working with me doing masking as something to perhaps look at to see if it'd help us later/next-time.

He apparently took me up on it more aggressively than I anticipated and today he indicated he had looked at it and found it hard to understand how to use it (documentation-wise). I.e. he tried it but didn't really get anywhere.

I'm not sure precisely what he meant but my perception from whatever he said (I wasn't trying to gather feedback for you at the time) was he didn't have/see any example code showing the different things he could do.

Having skimmed your documentation for 3 minutes just now, I am inclined to agree. Additionally I would observe that your Show HN bullet explanation of your product is actually clearer/tighter than your official documentation page, where my impatient eye tends to glaze over after a few bullets of stuff I am not sure if I care about...

replies(2): >>41882864 #>>41883588 #

1. woyten ◴[18 Oct 24 19:51 UTC] No.41882864[source]▶

Hi! Thank you for the feedback. I completely understand your concern and agree with you. It can indeed be difficult to get started, and we need to provide clearer use cases with examples to demonstrate the basic concepts and features. I’ll be working on revising the documentation soon to make it easier to follow.