←back to thread

480 points jedeusus | 2 comments | | HN request time: 0.001s | source
Show context
nopurpose ◴[] No.43540684[source]
Every perf guide recommends to minimize allocations to reduce GC times, but if you look at pprof of a Go app, GC mark phase is what takes time, not GC sweep. GC mark always starts with known live roots (goroutine stacks, globals, etc) and traverse references from there colouring every pointer. To minimize GC time it is best to avoid _long living_ allocations. Short lived allocations, those which GC mark phase will never reach, has almost neglible effect on GC times.

Allocations of any kind have an effect on triggering GC earlier, but in real apps it is almost hopeless to avoid GC, except for very carefully written programs with no dependenciesm, and if GC happens, then reducing GC mark times gives bigger bang for the buck.

replies(12): >>43540741 #>>43541092 #>>43541624 #>>43542081 #>>43542158 #>>43542596 #>>43543008 #>>43544950 #>>43545084 #>>43545500 #>>43551041 #>>43551691 #
Capricorn2481 ◴[] No.43541092[source]
Aren't allocations themselves pretty expensive regardless of GC?
replies(2): >>43541302 #>>43541882 #
epcoa ◴[] No.43541882[source]
No. If you have a moving multi generational GC, allocation is literally just an increment for short lived objects.
replies(3): >>43542264 #>>43542385 #>>43542676 #
pebal ◴[] No.43542676[source]
If you have a moving, generational GC, then all the benefits of fast allocation are lost due to data moving and costly memory barriers.
replies(3): >>43543435 #>>43544627 #>>43551434 #
gf000 ◴[] No.43544627[source]
Not at all. Most objects die young and thus are never moved. Also, the time before it is moved is very long compared to CPU operations so it is only statistically relevant (very good throughput, rare, longer tail on latency graphs).

Also, write-only barriers don't have that big of an overhead.

replies(1): >>43547834 #
pebal ◴[] No.43547834[source]
It doesn't matter if objects die young — the other objects on the heap are still moved around periodically, which reduces performance. When you're using a moving GC, you also have additional read barriers that non-moving GCs don't require.
replies(1): >>43554221 #
gf000 ◴[] No.43554221[source]
Is that period really that big of a concern when your threads in any language might be context switched away by the OS? It's not a common occurrence on a CPU-timeline at all.

Also, it's no accident that every high-performance GC runtime went the moving, generational way.

replies(1): >>43554488 #
pebal ◴[] No.43554488[source]
That time may seem negligible, since the OS can context switch threads anyway, but it’s still additional time during which your code isn’t doing its actual work.

Generations are used almost exclusively in moving GCs — precisely to reduce the negative performance impact of data relocation. Non-moving GCs are less invasive, which is why they don’t need generations and can be fully concurrent.

replies(1): >>43554837 #
gf000 ◴[] No.43554837[source]
I would rather say that generations are a further improvement upon a moving collector, improving space usage and decreasing the length of the "mark" phase.

And which GC is fully concurrent? I don't think that's possible (though I will preface that I am no expert, only read into the topic on a hobby level) - I believe the most concurrent GC out there is ZGC, which does read barriers and some pointer tricks to make the stop-the-world time independent of the heap size.

replies(1): >>43554962 #
pebal ◴[] No.43554962{3}[source]
Java currently has no fully concurrent GC, and due to the volume of garbage it manages and the fact that it moves objects, a truly fully concurrent GC for this language is unlikely to ever exist.

Non-moving GCs, however, can be fully concurrent — as demonstrated by the SGCL project for C++.

In my opinion, the GC for Go is the most likely to become fully concurrent in the future.

replies(1): >>43555538 #
1. gf000 ◴[] No.43555538{4}[source]
Is SGCL your project?

In that case, are you doing atomic writes for managed pointers/the read flag on them? I have read a few of your comments on reddit and your flags seem to be per memory page? Still, the synchronization on them may or may not have a more serious performance impact than alternative methods and without a good way to compare it to something like Java which is the state of the art in GC research we can't really comment much on whether it's a net benefit.

Also, have you perhaps tried modeling your design in something like TLA+?

replies(1): >>43556629 #
2. pebal ◴[] No.43556629[source]
Yes, SGCL is my project.

You can't write concurrent code without atomic operations — you need them to ensure memory consistency, and concurrent GCs for Java also rely on them. However, atomic loads and stores are cheap, especially on x86. What’s expensive are atomic counters and CAS operations — and SGCL uses those only occasionally.

Java’s GCs do use state-of-the-art technology, but it's technology specifically optimized for moving collectors. SGCL is optimized for non-moving GC, and some operations can be implemented in ways that are simply not applicable to Java’s approach.

I’ve never tried modeling SGCL's algorithms in TLA+.