Most active commenters
  • wewewedxfgdf(4)

←back to thread

224 points mlissner | 12 comments | | HN request time: 0.894s | source | bottom
1. wewewedxfgdf ◴[] No.45775817[source]
I tried DuckDB - liked it a lot - was ready to go further.

But found it to be a real hassle to help it understand the right number of threads and the amount of memory to use.

This led to lots of crashes. If you look at the projects github issues you will see many OOM out of memory errors.

And then there was some indexed bug that crashed seemingly unrelated to memory.

Life is too short for crashy database software so I reluctantly dropped it. I was disappointed because it was exactly what I was looking for.

replies(4): >>45776001 #>>45776020 #>>45776900 #>>45777350 #
2. lalitmaganti ◴[] No.45776001[source]
+1 this was my experience trying it out as well. I find that for getting started and for simple usecases it works amazing. But I have quite a lot of concerns about how it scales to more complex and esoteric workloads.

Non-deterministic OOMs especially are some of the worst things in the sort of tools I'd want to use DuckDB in and as you say, I found it to be more common than I would like.

3. tuhgdetzhh ◴[] No.45776020[source]
I can recommend earlyoom (https://github.com/rfjakob/earlyoom). Instead of freezing or crashing your system this tool kills the memory eating process just in time (in this case duckdb). This allows you repeat with smaller chunks of the dataset, until it fits into your mem.
replies(3): >>45776214 #>>45776329 #>>45777357 #
4. wewewedxfgdf ◴[] No.45776214[source]
Yeah memory and thread management is the job of the application, not me.
5. QuantumNomad_ ◴[] No.45776329[source]
When I there is a specific program I want to run with a limit on how much memory it is allowed to allocate, I have found systemd-run to work well.

It uses cgroups to enforce resource limits.

For example, there’s a program I wrote myself which I run on one of my Raspberry Pi. I had a problem where my program would on rare occasions use up too much memory and I wouldn’t even be able to ssh into the Raspberry Pi.

I run it like this:

  systemd-run --scope -p MemoryMax=5G --user env FOOBAR=baz ./target/release/myprog
The only difficulty I had was that I struggled to find the right name to use in the MemoryMax=… part because they’ve changed the name of it around between versions so different Linux systems may or may not use the same name for the limit.

In order to figure out if I had the right name for it, I tested different names for it with a super small limit that I knew was less than the program needs even in normal conditions. And when I found the right name, the program would as expected be killed right off the bat and so then I could set the limit to 5G (five gigabytes) and be confident that if it exceeds that then it will be killed instead of making my Raspberry Pi impossible to ssh into again.

6. mritchie712 ◴[] No.45776900[source]
what did you use instead? if you hit OOM with the dataset in duckdb, I'd think you'd hit the OOM with most other things on the same machine.
replies(1): >>45776924 #
7. wewewedxfgdf ◴[] No.45776924[source]
The software should manage its own memory not require the developer to set specific memory thresholds. Sure, a good thing to be able to say "use no more than X RAM".
8. thenaturalist ◴[] No.45777350[source]
How long ago was this, or can you share more context about data and mem size you experienced this with?

DuckDB has introduced spilling to disk and some other tweaks since a good year now: https://duckdb.org/2024/07/09/memory-management

replies(1): >>45777572 #
9. thenaturalist ◴[] No.45777357[source]
This looks amazing!

Have you used this in conjunction with DuckDB?

replies(1): >>45784528 #
10. wewewedxfgdf ◴[] No.45777572[source]
3 days ago.

The final straw was an index which generated fine on MacOS and failed on Linux - exact same code.

Machine had plenty of RAM.

The thing is, it is really the responsibility of the application to regulate its behavior based on available memory. Crashing out just should not be an option but that's the way DuckDB is built.

replies(1): >>45783704 #
11. alex-korr ◴[] No.45783704{3}[source]
I had the same experience - everything runs great on an AWS Linux EC2 with 32GB of memory, same workload in a docker on ECS with 32GB allocated gets an OOM. But for smaller workloads, DuckDB is fantastic... however, there's a certain point when Spark or Snowflake start to make more sense.
12. tuhgdetzhh ◴[] No.45784528{3}[source]
Yes, it works just fine.