I guess you can add some CI steps when modifying the db to ensure a give column is allowed or masked, but still, would be nice if this was defaulted the other way around.
More specifically the integration of this functionality at a fortunately ex-employer was purposefully kept away from the dev team (no motivation was offered, however I suspect that some sort of segmentation was sought after) and thus did not take into account that tables with PII did in fact still need their schema changed from time to time.
This lead to the anonymizer extension, together with the confidential views to only be installed on production DB instances with dev, test, and staging instances running vanilla postgres. With this, the possibility to catch DB migration issues related to the confidential views was pushed out to the release itself. This lead to numerous failed releases which involved having the ops team intervene, manually remove the views for the duration of the release, then manually re-create them.
So,
If you plan to use this extension, and specifically its views, make sure you have it set up in the exactly same way on all environments. Also make sure that its initialisation and view creation are part of your framework's DB migrations so that they are documented and easy to precisely reproduce on new environments.
According to its --help output, it is designed to retain the following properties of data:
- cardinalities of values (number of distinct values) for every column and for every tuple of columns;
- conditional cardinalities: number of distinct values of one column under condition on value of another column;
- probability distributions of absolute value of integers; sign of signed integers; exponent and sign for floats;
- probability distributions of length of strings;
- probability of zero values of numbers; empty strings and arrays, NULLs;
- data compression ratio when compressed with LZ77 and entropy family of codecs;
- continuity (magnitude of difference) of time values across table; continuity of floating point values.
- date component of DateTime values;
- UTF-8 validity of string values;
- string values continue to look somewhat natural
[1]: https://clickhouse.com/docs/en/operations/utilities/clickhou...
[1] A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release. https://en.wikipedia.org/wiki/K-anonymity [2] https://research.cbs.nl/casc/mu.htm
> The project has a declarative approach of anonymization.
> Finally, the extension offers a panel of detection functions that will try to guess which columns need to be anonymized.
https://postgresql-anonymizer.readthedocs.io/en/stable/detec...
> These methods will destroy the original data. Use with care.
https://dba.stackexchange.com/questions/306661/how-to-instal...
https://dba.stackexchange.com/questions/306661/how-to-instal...
reply
https://dba.stackexchange.com/questions/306661/how-to-instal...
reply
A good use case that comes to mind is using prod data in a retool app or something for your internal team but you want to mask out certain bits.
I’ve been building Neosync [1] to handle more advanced use cases where you want to anonymize data for lower level environments. This is more useful for stage or dev data. Then prod stays completely unexposed to anyone.
It also has a transactional anonymization api too.
It certainly complicates things but it provides an additional security layer of separation between the PII and it's related data. You can provide your end users access to a database without having to worry about them getting access to the "dangerous" data. If they do indeed need access to the data pointed to via the token they can request access to that related database.
This method also provides more performance since you don't need to encrypt the entire database (which is often required when storing PII) and also don't need to add extra security context function calls to every database request.
The summary is pretty funny:
> "After trying four methods, I got so tired of this problem that it was time just to choose something, make it into a usable tool, and announce the solution"
As a computer scientist and academic researcher having worked on this topic for now more than a decade (some of my work if you are interested: [1, 2]), re-identification is often possible from few pieces of information. Masking or replacing a few values or columns will often not provide sufficient guarantees—especially when a lot of information is being released.
What this tool does is called ‘pseudonymization’ and maybe, if not very carefully, ‘de-identification’ in some case. With colleagues, reviewed all the literature and industry practices a few months ago [3], and our conclusion was:
> We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today.
This is clearly not what this tool is doing.
[1] https://www.nature.com/articles/s41467-019-10933-3 [2] https://www.nature.com/articles/s41467-024-55296-6 [3] https://www.science.org/doi/10.1126/sciadv.adn7053
https://github.com/nucleuscloud/neosync
(I'm one of the co-founders)
I checked the homepage but I do not watch Loom-style demos personally, definitely not 5 minute ones, and so I left.
-
When I click on OP's link, or just search for it on Google, it takes less than a full page for the extension to show me an extremely straightforward demonstration of its value. You should have something like that.
A simple example of what queries will look like, what setup will look like, all concisely communicated, no 5 minute lectures involved.
All that said, I wouldn't rely on this extension as a way to deliver anonymized data to downstream consumers outside of our software team. As others have pointed out, this is really more of a pseudonymization technique. It's great for removing phone numbers, emails, etc. from your data set, but it's not going to eradicate PII. Pretty much all anonymized records can be traced back to their source data through PKs or FKs.
This extension is currently not available on RDS but it is available on many others DBaaS providers : Azure SQL, Google Cloud SQL, Crunchy Bridge, ....
https://postgresql-anonymizer.readthedocs.io/en/stable/maski...
The extension offers a large panel of masking functions : some are pseudonymizing functions but others are more destructive. For instance there's large collection of fake data generators ( names, address, phones, etc. )
It's up to the database administrator or the application developer to decide which columns need to be masked and how it should be masked.
In some use cases, pseudonymization is enough and others anonymization is required....
https://postgresql-anonymizer.readthedocs.io/en/stable/priva...
* Static Masking will destroy the authentic data once for all
* Dynamic Masking will only alter the data the "masked users". Regular users will continue to view the real data.
Both techniques have their own advantage depending on your context.
There are many other masking functions that will actually anonymize the data.
And the extension does not force you to respect the foreign keys.
It's really up to you to decide how you want to implement your masking policy