Trippy – A Network Diagnostic Tool

1. commandersaki ◴[10 Dec 23 10:28 UTC] No.38590611[source]▶

Copying the per-hop loss indicator from mtr is a bad decision in my opinion. It's always been a source of incorrect diagnosis of network issues. The only loss that matters is end to end.

replies(4): >>38590932 #>>38591072 #>>38591120 #>>38591527 #

2. FujiApple ◴[10 Dec 23 11:51 UTC] No.38590932[source]▶

>>38590611 (TP) #

You are right that showing packet loss for intermediate hops is a frequent source of confusion.

Rather than leave it out, I added a status column which shows different statuses for intermediate hops (blue if the hop responds to less than 100% of probes and brown if it responds to 0%) vs the target hop (which show amber and red respectively).

Where this breaks down is when dealing with ECMP for UDP & TCP tracing, as a given hop (ttl) may represent the target for a given round of tracing but not for the next. The mistake, imho, is to associate _any_ data with a hop (ttl) rather than the hop in the context of a tracing flow.

That is why Trippy had a number of features aimed at helping with ECMP, such as Paris and Dublin tracing, and the ability to filter tracing by unique flow id. I've covered these quite a bit in the 0.8.0 [0] and 0.9.0 [1] release notes if you want to know more.

[0] https://github.com/fujiapple852/trippy/releases/tag/0.8.0

[1] https://github.com/fujiapple852/trippy/releases/tag/0.9.0

replies(1): >>38611645 #

3. ◴[10 Dec 23 12:26 UTC] No.38591072[source]▶

>>38590611 (TP) #

4. lathiat ◴[10 Dec 23 12:37 UTC] No.38591120[source]▶

>>38590611 (TP) #

That is not entirely true. Sure it’s not a 100% reliable signal as routing is asymmetric but it also often isn’t and also gives you an idea of which point to ask first at least.

If the packet loss starts at your wifi router, or your ISPs router. Or the next hop after you ISP. That all gives you a bit of an idea where the problem likely is. I solve problems like that all the time.

replies(1): >>38591827 #

5. mrngm ◴[10 Dec 23 13:56 UTC] No.38591527[source]▶

>>38590611 (TP) #

At NANOG 62, there was a presentation "A practical guide to (correctly) troubleshooting with traceroute", by Richard Steenbergen (slides [pdf]: https://archive.nanog.org/sites/default/files/tuesday_steenb..., talk [video]: https://www.youtube.com/watch?v=WL0ZTcfSvB4).

It also mentions that isolated hops that show increased latency or loss are most likely throttling on the device. However, if that latency or loss persists on further hops, that indicates a problem.

Another issue with traceroutes is that is usually doesn't account for asymmetry in the return path. What I would find interesting to see is something like isoping/splitping (see this blog post https://blog.benjojo.co.uk/post/ping-with-loss-latency-split), to account for that asymmetry.

Regarding the tool trippy itself: awesome visualizations!

replies(2): >>38591836 #>>38593808 #

6. commandersaki ◴[10 Dec 23 14:46 UTC] No.38591827[source]▶

>>38591120 #

I was a network engineer for over a decade in hosting datacenter environments and would get false reports of packet loss to various destinations because people would use MTR and say they see loss on the path. If a packet takes the path A -> B -> C and pings to B have 50% loss but pings to C have 0% loss, then the path is perfectly fine.

The only way to reliably isolate packet loss to a hop on the path is to have a destination for testing where packets pass through that hop and is in its bailiwick which doesn't perform rate limiting or policing of ICMP traffic.

replies(1): >>38591945 #

7. commandersaki ◴[10 Dec 23 14:48 UTC] No.38591836[source]▶

>>38591527 #

Yep it's a great presentation.

I like to always say that traceroute gives you the approximate path a packet will take, whereas ping is for end to end measurement and loss. I'm personally not a big fan of combining the two tools.

8. FujiApple ◴[10 Dec 23 15:04 UTC] No.38591945{3}[source]▶

>>38591827 #

Something I intend to add to Trippy, but have not got around to it yet; is to codify the "If a packet takes the path A -> B -> C and pings to B have 50% loss but pings to C have 0% loss, then the path is perfectly fine" idea and use that to produce more meaningful headline status information to the user. How would you codify this?

replies(2): >>38592208 #>>38592597 #

9. commandersaki ◴[10 Dec 23 15:35 UTC] No.38592208{4}[source]▶

>>38591945 #

I would love for there to be a useful indicator to the user to say if loss or latency is an issue.

Being able to indicate cascading loss (e.g. path A->B->C->D) shows loss at B, C, and D, is worth bubbling up to the user to say there might be real issues. Also any indication of loss at D is also an issue. Trying to reconcile these scenarios with the UI matters, but I don't think there's an easy way. What I think is more important than UI that is sorely needed is documentation / users guide explaining how to read and understand these indicators. I know documentation is usually overlooked by users first trying out a program, but having it documented and explained can be used as a reference to point to a user that is misunderstanding the tool. I found that MTR didn't have this much needed documentation / reference that people would easily misunderstand the tool and it was a herculean effort to correct them.

I would also like to point out that a 0% loss indicator at the destination isn't reliable either if the packets are spaced out with enough slack. One of my goto when testing packet loss of a link I've brought up is to smash a destination host with a ping flood, e.g. ping -c 100 -f 1.1.1.1. By inundating the link it helps provide a clear indicator if there is loss somewhere on the path (usually the first mile or the last). Cloudflare speedtest now has a packet loss tester that floods 1000 packets, although I'm not sure if it does it over an unreliable transport or not.

replies(1): >>38592523 #

10. FujiApple ◴[10 Dec 23 16:13 UTC] No.38592523{5}[source]▶

>>38592208 #

I agree regarding documentation. There was a request [0] for something similar, though not specifically covering this important point.

Regarding sending a ping flood, Trippy allow you to reduce the minimum and maximum round time (and grace period) to send packets almost as fast as you like. For example, to send at 50ms intervals (with a 10ms grace period):

> trip example.com -i 50ms -T 50ms -g 10ms

[0] https://github.com/fujiapple852/trippy/issues/853

11. linsomniac ◴[10 Dec 23 16:21 UTC] No.38592597{4}[source]▶

>>38591945 #

It's probably tricky but if there's loss at D, maybe only then materialize the display of the loss backwards until there is no loss: C? B? A? It gets tricky though where maybe there is a small loss at D, but say that C and B have chronic loss because of throttling in the slow path responses.

If D has 1% loss and B and C have 50%, is it fair to say A=0, B=1%, C=1%, D=1%?

MTR display of loss is indeed confusing, but when weird things are going on it can be helpful just stare at it a while to see what's going on. Trippy looks fantastic, and I need to play with it, but there are cases where I just want to stare at the path loss for a while.

There's no way to influence the TTL on TTL timed-out responses, is there? That'd be pretty cool if there were some way to get the return path of the intermediaries to reply.

replies(1): >>38596571 #

12. tptacek ◴[10 Dec 23 18:52 UTC] No.38593808[source]▶

>>38591527 #

This is a cool talk, thanks for posting it. One cute thing tools like `trippy` could do (from the talk) would be the reverse lookup on the peer /30 address for all the intermediate hops. I don't know if `trippy` does this (I've installed and played with it but not carefully), but you could also color-code latency spikes that persist into future hops.

13. FujiApple ◴[11 Dec 23 01:02 UTC] No.38596571{5}[source]▶

>>38592597 #

Thanks for that, I'll give this some thought and write a proposal in this [0] placeholder issue.

[0] https://github.com/fujiapple852/trippy/issues/860

14. geraldhh ◴[12 Dec 23 13:08 UTC] No.38611645[source]▶

>>38590932 #

> Rather than leave it out, I added a status column

but why?