←back to thread

160 points simplesort | 5 comments | | HN request time: 0s | source
1. __turbobrew__ ◴[] No.43627708[source]
Maybe Im missing something but can’t you run workloads in separate network namespaces and then attach a bpf probe to the veth interface in the namespace? At that point you know all flows on that veth are from a specific workload as long as you keep track of what is running in which network namespaces?

I wonder if it is possible with ipv6 to never (or you roll through the addresses so reuse is temporally distant) re use addresses which removes the problems with staleness and false attribution.

replies(1): >>43627904 #
2. VaiTheJice ◴[] No.43627904[source]
I think thats pretty reasonable tbf and probably at a more 'simpler' scale and i use simple loosely because Netflix’s container runtime is Titus, which is more bare metal oriented than, say, Kubernetes. It doesn’t always isolate workloads as cleanly in separate netns per container, especially for network optimisation purposes like IPv6-to-IPv4 sharing.

"I wonder if it is possible with ipv6 to never... re use addresses which removes the problems with staleness and false attribution."

Most VPCs (also AWS) don’t currently support "true" IPv6 scaleout behavior. Buttt!! if IPs were truly immutable and unique per workload, attribution becomes trivial. It’s just not yet realistic... maybe something to explore with the lads?

replies(1): >>43628269 #
3. __turbobrew__ ◴[] No.43628269[source]
Makes sense, I have worked in and around CNI stuff for k8s and generally netns+veth is how most of them work. That being said we run k8s on bare metal, there isn’t any reason why running things on bare metal excludes netns usage.

> Most VPCs (also AWS) don’t currently support "true" IPv6 scaleout behavior.

Thats a shame.

> if IPs were truly immutable and unique per workload, attribution becomes trivial

I would like to see that. IPAM for multi-tenant workloads always felt like a kludge. You need the network to understand how to route to a workloads, but the network when running on ipv4 has many more workloads than addresses. If you assign immutable addresses per workload (or say it takes you a month to chew through your ipv6 address space) it makes it so the network natively knows how to route to workloads without the need to kludge with IP reassignments.

I have had to deal with IP address pools being exhausted due to high pod churn in EC2 a number of times and it is always a pain.

replies(1): >>43628429 #
4. VaiTheJice ◴[] No.43628429{3}[source]
Ahh! Nothing like watching pods fail to schedule because you ran out of assignable IPs in a subnet you thought was generous.

Immutable addressing per workload with IPv6 feels like such a clean mental model, especially for attribution, routing, and observability.

Curious if you have seen anyone pull that off cleanly in production, like truly immutable addressing at scale? Curious if it’s been battle tested somewhere or still mostly an ideal.

replies(1): >>43628742 #
5. __turbobrew__ ◴[] No.43628742{4}[source]
Unfortunately my place is still stuck on ipv4.

Hypothetically it is not hard, you split your ipv6 prefix per datacenter. Then you use etcd to coordinate access to the ipv6 pool to hand out immutable addresses. You just start from the lowest address and go to the highest address. If you get to the highest address you go back to the lowest address, as long as your churn is not too high and your pool is big enough you should only wrap addresses far enough apart in time that address reuse doesn't cause any problems with false attribution.

In the etcd store you can just store KV pairs of ipv6 -> workload ID. If you really want to be fancy you can watch those KV pairs using etcd clients and get live updates of new addresses being assigned to workloads. You can plug these updates into your system of choice which needs to map ipv6 to workload such as network flow tools.

Unless you are doing something insane, you should easily be able to keep up with immutable address requests with a DC local etcd quorum.