Pg_parquet: An extension to connect Postgres and parquet

(www.crunchydata.com)

1. aamederen ◴[17 Oct 24 16:45 UTC] No.41871295[source]▶

Congratulations! I'm happy to see the PostgreSQL license.

2. linuxhansl ◴[17 Oct 24 20:55 UTC] No.41873697[source]▶

Parquet itself is actually not that interesting. It should be able to read (and even write) Iceberg tables.

Also, how does it compare to pg_duckdb (which adds DuckDB execution to Postgres including reading parquet and Iceberg), or duck_fdw (which wraps a DuckDB database, which can be in memory and only pass-through Iceberg/Parquet tables)?

replies(3): >>41874044 #>>41874177 #>>41876793 #

3. oulipo ◴[17 Oct 24 21:08 UTC] No.41873813[source]▶

>>41871068 (OP) #

Cool, would this be better than using a clickhouse / duckdb extension that reads postgres and saves to Parquet?

What would be recommended to output regularly old data to S3 as parquet file? To use a cron job which launches a second Postgres process connecting to the database and extracting the data, or using the regular database instance? doesn't that slow down the instance too much?

replies(2): >>41874097 #>>41881043 #

4. whalesalad ◴[17 Oct 24 21:26 UTC] No.41873978[source]▶

>>41871068 (OP) #

I wish RDS made it easy to add custom extensions like this.

replies(3): >>41874650 #>>41875299 #>>41918212 #

5. AdamProut ◴[17 Oct 24 21:33 UTC] No.41874044[source]▶

>>41873697 #

Had a similar thought. Azure Postgres has something similar to pg_parquet (pg_azure_storage), but we're looking into replacing it with pg_duckdb assuming the extension continues to mature.

It would be great if the Postgres community could get behind one good opensource extension for the various columnstore data use cases (querying data stored in an open columnstore format - delta, iceberg, etc. being one of them). pg_duckdb seems to have the best chance at being the goto extension for this.

replies(1): >>41874183 #

6. craigkerstiens ◴[17 Oct 24 21:39 UTC] No.41874097[source]▶

>>41873813 #

This alone wouldn't be a full replacement. We do have a full product that does that with customers seeing great performance in production. Crunchy Bridge for Analytics does similar by embedding DuckDB inside Postgres, though for users is largely an implementation detail. We support iceberg as well and have a lot more coming basically to allow for seamless analytics on Postgres building on what Postgres is good at, iceberg for storage, and duckdb for vectorized execution.

That isn't fully open source at this time but has been production grade for some time. This was one piece that makes getting to that easier for folks and felt a good standalone bit to open source and share with the broader community. We can also see where this by itself for certain use cases makes sense, as you sort of point out if you had time series partitioned data, leveraged partman for new partitions and pg_cron which this same set of people authored you could automatically archive old partitions to parquet but still have thing for analysis if needed.

7. mslot ◴[17 Oct 24 21:50 UTC] No.41874177[source]▶

>>41873697 #

(Marco from Crunchy Data)

With PostgreSQL extensions, we find it's most effective to have single-purpose modular extensions.

For instance, I created pg_cron a few years ago, and it's on basically every PostgreSQL service because it does one thing and does it well.

We wanted to create a light-weight implementation of Parquet that does not pull a multi-threaded library into every postgres process.

When you get to more complex features, a lot of questions around trade-offs, user experience, and deployment model start appearing. For instance, when querying an Iceberg table, caching becomes quite important, but that raises lots of other questions around cache management. Also, how do you deal with that memory hungry, multi-threaded query engine running in every process without things constantly falling over?

It's easier to answer those questions in the context of a managed service where you control the environment, so we have a product that can query Iceberg/Parquet/CSV/etc. in S3, does automatic caching, figures out the region of your bucket, can create tables directly from files, and uses DuckDB to accelerate queries in a reliable manner. This is partially powered by a set of custom extensions, partially by other things running on the managed service. https://docs.crunchybridge.com/analytics

However, some components can be neatly extracted and shared broadly like COPY TO/FROM Parquet. We find it very useful for archiving old partitions, importing public and private data sets, preparing data for analytics, and moving data between PostgreSQL servers.

8. mslot ◴[17 Oct 24 21:51 UTC] No.41874183{3}[source]▶

>>41874044 #

Fun fact, I created pg_azure_storage :)

replies(2): >>41877261 #>>41918183 #

9. wdb ◴[17 Oct 24 22:56 UTC] No.41874650[source]▶

>>41873978 #

or Google Cloud

10. treefarmer ◴[18 Oct 24 00:41 UTC] No.41875299[source]▶

>>41873978 #

Yeah, I'm still surprised they haven't added a list of unsupported extensions (that you can add but they're not responsible for the performance of).

replies(1): >>41885515 #

11. jeadie ◴[18 Oct 24 00:52 UTC] No.41875366[source]▶

>>41871068 (OP) #

Why not just federate Postgres and parquet files? That way the query planner can push down as much of the query and reduce how much data has to move about?

12. RMarcus ◴[18 Oct 24 03:46 UTC] No.41876178[source]▶

>>41871068 (OP) #

This is awesome, thanks for creating this. I've had to write some absolutely wonky scripts to dump a PostgreSQL database into Parquet, or read a Parquet file into PostgreSQL. Normally some terrible combination of psycopg and pyarrow, which worked, but it was ad-hoc and slightly different every time.

A lot of other commenters are talking about `pg_duckdb` which maybe also could've solved my problem, but this looks quite simple and clean.

I hope for some kind of near-term future where there's some standardish analytics-friendly data archival format. I think Parquet is the closest thing we have now.

13. fulafel ◴[18 Oct 24 06:21 UTC] No.41876793[source]▶

>>41873697 #

Having the famously crashy DuckDB share a process and memory with PostgreSQL doesn't seem like the most robust setup.

replies(2): >>41876976 #>>41880138 #

14. skeptrune ◴[18 Oct 24 06:56 UTC] No.41876976{3}[source]▶

>>41876793 #

I had the exact same reaction

15. brinox ◴[18 Oct 24 08:00 UTC] No.41877261{4}[source]▶

>>41874183 #

I was just wondering if pg_parquet could be combined with pg_azure_storage to write Parquet files to Azure Storage.

I had problems with pg_azure_storage in the past, because the roles pg_read_server_files and pg_write_server_files are unassignable on Azure PostgreSQL databases which makes the use of `COPY {FROM,TO}` impossible.

replies(1): >>41877608 #

16. mslot ◴[18 Oct 24 09:10 UTC] No.41877608{5}[source]▶

>>41877261 #

Azure is not supported as a backend in pg_parquet right now, but shouldn't be hard to add (contributions welcome!)

https://github.com/CrunchyData/pg_parquet

It would not be safe to let any user access object storage. Therefore, pg_parquet has two roles called parquet_object_store_read and parquet_object_store_write that give permission to COPY FROM/TO object storage (but not local file system).

In pg_azure_storage there is a comparable azure_storage_admin role that needs to be granted to users that need Azure Blob Storage permission.

17. fforflo ◴[18 Oct 24 09:53 UTC] No.41877803[source]▶

>>41871068 (OP) #

I can see myself using this as alternative to foreign data wrappers and/or pg_dump even.

18. drewbitt ◴[18 Oct 24 10:12 UTC] No.41877892[source]▶

>>41871068 (OP) #

https://github.com/pgspider/parquet_s3_fdw is the foreign data wrapper alternative

19. memhole ◴[18 Oct 24 15:09 UTC] No.41880138{3}[source]▶

>>41876793 #

Famously crashy? Any incidents you can share? I’ve only had good experiences is why I ask.

replies(1): >>41885827 #

20. coredog64 ◴[18 Oct 24 16:37 UTC] No.41881043[source]▶

>>41873813 #

RDS for MySQL and Postgres natively support exporting a consistent snapshot to S3 in Parquet format.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...

21. bastawhiz ◴[19 Oct 24 03:49 UTC] No.41885515{3}[source]▶

>>41875299 #

Amazingly their own aws_s3 extension isn't even supported in their multi-az cluster configuration.

22. fulafel ◴[19 Oct 24 05:25 UTC] No.41885827{4}[source]▶

>>41880138 #

I haven't used it, but have gone through their issue tracker and seen a lot of comments to this effect here and on other internets places. (Unverifiable: also some hearsay from colleagues)

Not saying they're doing it wrong, it just seems they have some different stability vs performance tradeoffs than PG.

23. jakozaur ◴[20 Oct 24 08:48 UTC] No.41893976[source]▶

>>41871068 (OP) #

It's good for small data, but the Iceberg format would be nicer for bigger data sets.

24. dektol ◴[22 Oct 24 20:10 UTC] No.41918183{4}[source]▶

>>41874183 #

Is pg_azure_storage available on GitHub?

25. mdaniel ◴[22 Oct 24 20:15 UTC] No.41918212[source]▶

>>41873978 #

I was expecting https://github.com/wasmerio/wasmer-postgres#usage--documenta... to catch on, and some how must have gotten it mixed up with Trusted Language Extensions <https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...> (and https://github.com/aws/pg_tle/tree/v1.4.0#getting-started ) but unless someone decides to implement every single bit of a modern PG extension in JS (or Perl), that's just some hello-world nonsense