Problems with Go channels (2016)

1. t8sr ◴[13 Apr 25 11:15 UTC] No.43671930[source]▶

When I did my 20% on Go at Google, about 10 years ago, we already had a semi-formal rule that channels must not appear in exported function signatures. It turns out that using CSP in any large, complex codebase is asking for trouble, and that this is true even about projects where members of the core Go team did the CSP.

If you take enough steps back and really think about it, the only synchronization primitive that exists is a futex (and maybe atomics). Everything else is an abstraction of some kind. If you're really determined, you can build anything out of anything. That doesn't mean it's always a good idea.

Looking back, I'd say channels are far superior to condition variables as a synchronized cross-thread communication mechanism - when I use them these days, it's mostly for that. Locks (mutexes) are really performant and easy to understand and generally better for mutual exclusion. (It's in the name!)

replies(5): >>43672034 #>>43672125 #>>43672192 #>>43672501 #>>43687905 #

2. throwaway150 ◴[13 Apr 25 11:41 UTC] No.43672034[source]▶

>>43671930 (TP) #

What is "20% on Go"? What is it 20% of?

replies(3): >>43672063 #>>43672064 #>>43672071 #

3. NiloCK ◴[13 Apr 25 11:44 UTC] No.43672063[source]▶

>>43672034 #

Google historically allowed employees to self-direct 20% of their working time (onto any google project I think).

4. darkr ◴[13 Apr 25 11:44 UTC] No.43672064[source]▶

>>43672034 #

At least historically, google engineers had 20% of their time to spend on projects not related to their core role

replies(1): >>43672293 #

5. ramon156 ◴[13 Apr 25 11:46 UTC] No.43672071[source]▶

>>43672034 #

I assume this means "20% of my work on go" aka 1 out of 5 work days working on golang

6. dfawcus ◴[13 Apr 25 12:01 UTC] No.43672125[source]▶

>>43671930 (TP) #

How large do you deem to be large in this context?

I had success in using a CSP style, with channels in many function signatures in a ~25k line codebase.

It had ~15 major types of process, probably about 30 fixed instances overall in a fixed graph, plus a dynamic sub-graph of around 5 processes per 'requested action'. So those sub-graph elements were the only parts which had to deal with tear-down, and clean up.

There were then additionally some minor types of 'process' (i.e. goroutines) within many of those major types, but they were easier to reason about as they only communicated with that major element.

Multiple requested actions could be present, so there could be multiple sets of those 5 process groups connected, but they had a maximum lifetime of a few minutes.

I only ended up using explicit mutexes in two of the major types of process. Where they happened to make most sense, and hence reduced system complexity. There were about 45 instances of the 'go' keyword.

(Updated numbers, as I'd initially misremembered/miscounted the number of major processes)

replies(1): >>43674013 #

7. ChrisSD ◴[13 Apr 25 12:12 UTC] No.43672192[source]▶

>>43671930 (TP) #

I think the two basic synchronisation primitives are atomics and thread parking. Atomics allow you to share data between two or more concurrently running threads whereas parking allows you to control which threads are running concurrently. Whatever low-level primitives the OS provides (such as futexes) is more an implementation detail.

I would tentatively make the claim that channels (in the abstract) are at heart an interface rather than a type of synchronisation per se. They can be implemented using Mutexes, pure atomics (if each message is a single integer) or any number of different ways.

Of course, any specific implementation of a channel will have trade-offs. Some more so than others.

replies(2): >>43672274 #>>43672344 #

8. im3w1l ◴[13 Apr 25 12:29 UTC] No.43672274[source]▶

>>43672192 #

To me message passing is like it's own thing. It's the most natural way of thinking about information flow in a system consisting of physically separated parts.

9. kyrra ◴[13 Apr 25 12:32 UTC] No.43672293{3}[source]▶

>>43672064 #

This still exists today. For example, I am on the payments team but I have a 20% project working on protobuf. I had to get formal approval from my management chain and someone on the protobuf team. And it is tracked as part of my performance reviews. They just want to make sure I'm not building something useless that nobody wants and that I'm not just wasting the company's time.

replies(3): >>43672411 #>>43672772 #>>43675474 #

10. LtWorf ◴[13 Apr 25 12:39 UTC] No.43672344[source]▶

>>43672192 #

What you think is not very relevant if it doesn't match how CPUs work.

replies(1): >>43673039 #

11. vrosas ◴[13 Apr 25 12:49 UTC] No.43672411{4}[source]▶

>>43672293 #

I see why they do this, but man it almost feels like asking your boss for approval on where you go on vacation. Do people get dinged if their 20% time project doesn't pan out, or they lose interest later on?

replies(2): >>43672760 #>>43672788 #

12. i_don_t_know ◴[13 Apr 25 13:01 UTC] No.43672501[source]▶

>>43671930 (TP) #

> When I did my 20% on Go at Google, about 10 years ago, we already had a semi-formal rule that channels must not appear in exported function signatures.

That sounds reasonable. From what little Erlang/Elixir code I’ve seen, the sending and receiving of messages is also hidden as an implementation detail in modules. The public interface did not expose concurrency or synchronization to callers. You might use them under the hood to implement your functionality, but it’s of no concern to callers, and you’re free to change the implementation without impacting callers.

replies(1): >>43672778 #

13. NBJack ◴[13 Apr 25 13:45 UTC] No.43672760{5}[source]▶

>>43672411 #

Previously it could be anything you wanted. These days, you need formal approval. Google has changed a bit.

14. rollcat ◴[13 Apr 25 13:47 UTC] No.43672772{4}[source]▶

>>43672293 #

I never worked at Google (or any other large corp for that matter), but this sounds like the exact opposite of an environment that spawned GMail.

As you think back even to the very early days of computing, you'll find individuals or small teams like Grace Hopper, the Unix gang, PARC, etc that managed to change history by "building something useless". Granted, throughout history that happened less than 1% of the time, but it will never happen if you never try.

Maybe Google no longer has any space for innovation.

replies(1): >>43672878 #

15. throwawaymaths ◴[13 Apr 25 13:48 UTC] No.43672778[source]▶

>>43672501 #

AND because they're usually hidden as implementation detail, a consumer of your module can create simple mocks of your module (or you can provide one)

16. kyrra ◴[13 Apr 25 13:49 UTC] No.43672788{5}[source]▶

>>43672411 #

It has nothing to do with success. It's entirely for making sure some one besides the person doing the 20% agrees with the idea behind the project.

replies(1): >>43674076 #

17. jasode ◴[13 Apr 25 14:01 UTC] No.43672878{5}[source]▶

>>43672772 #

>I never worked at Google (or any other large corp for that matter), but this sounds like the exact opposite of an environment that spawned GMail.

Friendly fyi... GMail was not a "20% project" which I mentioned previously: https://news.ycombinator.com/item?id=39052748

Somebody (not me but maybe a Google employee) also revised the Wikipedia article a few hours after my comment: https://en.wikipedia.org/w/index.php?title=Side_project_time...

Before LLMs and ChatGPT even existed ... a lot of us somehow hallucinated the idea that GMail came from Google's 20% Rule. E.g. from 2013-08-16 : https://news.ycombinator.com/item?id=6223466

replies(1): >>43673091 #

18. ChrisSD ◴[13 Apr 25 14:25 UTC] No.43673039{3}[source]▶

>>43672344 #

huh?

replies(1): >>43674055 #

19. rollcat ◴[13 Apr 25 14:32 UTC] No.43673091{6}[source]▶

>>43672878 #

I see, thank you for debunking. But I think my general point still stands. You can progress by addressing a need, but true innovation requires adequate space.

20. hedora ◴[13 Apr 25 16:38 UTC] No.43674013[source]▶

>>43672125 #

How many developers did that scale to? Code bases that I’ve seen that are written in that style are completely illegible. Once the structure of the 30 node graph falls out of the last developer’s head, it’s basically game over.

To debug stuff by reading the code, each message ends up having 30 potential destinations.

If a request involves N sequential calls, the control flow can be as bad as 30^N paths. Reading the bodies of the methods that are invoked generally doesn’t tell you which of those paths are wired up.

In some real world code I have seen, a complicated thing wires up the control flow, so recovering the graph from the source code is equivalent to the halting problem.

None of these problems apply to async/await because the compiler can statically figure out what’s being invoked, and IDE’s are generally as good at figuring that out as the compiler.

replies(1): >>43674250 #

21. hedora ◴[13 Apr 25 16:43 UTC] No.43674055{4}[source]▶

>>43673039 #

I think they mean that message channels are an expensive and performance unstable abstraction.

You could address the concern by choosing a CPU architecture that included infinite capacity FIFOS that connected its cores into arbitrary runtime directed graphs.

Of course, that architecture doesn’t exist. If it did, dispatching an instruction would have infinite tail latency and unbounded power consumption.

22. hedora ◴[13 Apr 25 16:46 UTC] No.43674076{6}[source]▶

>>43672788 #

Lol. They’d be better off giving people the option to work 4 days if they also signed over right of first refusal for hobby projects.

23. dfawcus ◴[13 Apr 25 17:11 UTC] No.43674250{3}[source]▶

>>43674013 #

That was two main developers, one doing most of the code and design, the other a largely closed subset of 3 or 4 nodes. Plus three other developers co-opted for implementing some of the nodes. [1]

The problem space itself could have probably grown to twice the number of lines of code, but there wouldn't have needed to be any more developers. Possibly only the original two. The others were only added for meeting deadlines.

As to the graph, it was fixed, but not a full mesh. A set of pipelines, with no power of N issue, as the collection of places things could talk to was fixed.

A simple diagram represented the major message flow between those 30 nodes.

Testing of each node was able to be performed in isolation, so UT of each node covered most of the behaviour. The bugs were three deadlocks, one between two major nodes, one with one major node.

The logging around the trigger for the deadlock allowed the cause to be determined and fixed. The bugs arose due to time constraints having prevented an analysis of the message flows to detect the loops/locks.

So for most messages, there were a limited number of destinations, mostly two, for some 5.

For a given "request", the flow of messages to the end of the fixed graph would be passing through 3 major nodes. That then spawned the creation of the dynamic graph, with it having two major flows. One a control flow through another 3, the other a data flow through a different 3.

Within that dynamic graph there was a richer flow of messages, but the external flow from it simply had the two major paths.

Yes, reading the bodies of the methods does not inform as to the flows. One either had to read the "main" routine which built the graph, or better refer to the graph diagram and message flows in the design document.

Essentially a similar problem to dealing with "microservices", or plugable call-backs, where the structure can not easily be determined from the code alone. This is where design documentation is necessary.

However I found it easier to comprehend and work with / debug due to each node being a prodable "black box", plus having the graph of connections and message flows.

[1] Of those, only the first had any exerience with CSP or Go. The CSP expereince being with a library for C, the Go experience some minimal use a year earlier. The other developers were all new to CSP and Go. The first two developers were "senior" / "experienced".

24. codr7 ◴[13 Apr 25 20:09 UTC] No.43675474{4}[source]▶

>>43672293 #

Which misses the point of 20% imo; exploring space that would likely be missed in business as usual, encouraging creativity.

25. catern ◴[15 Apr 25 00:50 UTC] No.43687905[source]▶

>>43671930 (TP) #

>If you take enough steps back and really think about it, the only synchronization primitive that exists is a futex (and maybe atomics). Everything else is an abstraction of some kind.

You're going to be surprised when you learn that futexes are an abstraction too, ultimately relying on this thing called "cache coherence".

And you'll be really surprised when you learn how cache coherence is implemented.