Most active commenters

    ←back to thread

    462 points jakevoytko | 37 comments | | HN request time: 0.989s | source | bottom
    1. BobbyTables2 ◴[] No.43490119[source]
    Interesting writeup, but 2 days to debug “the hardest bug ever”, while accurate, seems a bit overdone.

    Though abs() returning negative numbers is hilarious.. “You had one job…”

    To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.

    I’m not just talking about concurrency issues either…

    The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.

    2 days is cute though.

    replies(13): >>43490149 #>>43490287 #>>43490459 #>>43490557 #>>43491079 #>>43491823 #>>43492539 #>>43492555 #>>43492647 #>>43493115 #>>43493245 #>>43493811 #>>43497018 #
    2. jakevoytko ◴[] No.43490149[source]
    Author here! I debugged a fair number of those when I was a systems engineer in soft real time robotics systems, but none of them felt as bad in retrospect because you're just reading up on the system and mulling over it and eventually you get the answer in a shower thought. Maybe I just find the puzzle of them fun, I don't know why they don't feel quite so bad. This was just an exhausting 2-day brute-force grind where it turned out the damn compiler was broken.
    replies(1): >>43490386 #
    3. userbinator ◴[] No.43490287[source]
    The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.

    The hardest one I've debugged took a few months to reproduce, and would only show up on hardware that only one person on the team had.

    One of the interesting things about working on a very mature product is that bugs tend to be very rare, but those rare ones which do appear are also extremely difficult to debug. The 2-hour, 2-day, and 2-week bugs have long been debugged out already.

    replies(3): >>43490473 #>>43491034 #>>43491397 #
    4. gertlex ◴[] No.43490386[source]
    I also came to the comments to weigh in on my perception of how rough this was, but instead will ask:

    Regarding "exhausting 2-day brute-force grind": is/was this just how you like to get things done, or was there external pressure of the "don't work on anything else" sort? I've never worked at a large company, and lots of descriptions of the way things get done are pretty foreign to me :). I am also used to being able to say "this isn't getting figured out today; probably going to be best if I work on something else for a bit, and sleep on it, too".

    replies(1): >>43490425 #
    5. jakevoytko ◴[] No.43490425{3}[source]
    The fatal error volume was so overwhelming that we didn't have any option but understanding the problem in perfect detail so that we could fix it if the problem was on our side, or avoid it if it was caused by something like our compiler or the browser.

    Our team also had a very grindy culture, so "I'm going to put in extra hours focusing exclusively on our top crash" was a pretty normalized behavior. After I left that team (and Google), most of my future teams have been more forgiving on pace for non-outages.

    replies(2): >>43495177 #>>43503603 #
    6. Terr_ ◴[] No.43490459[source]
    This repro was a few times per day, but try fixing a Linux kernel panic when you don't even have C/C++ on your resume, and everyone who originally set stuff up has left...

    https://news.ycombinator.com/item?id=37859771

    Point being that the difficulty of a fix can come from many possible places.

    7. gmueckl ◴[] No.43490473[source]
    That reminded me of a former colleague at the desk next to me randomly exclaiming one day that he had just fixed a bug he had created 20 years ago.

    The bug was actually quite funny in a way: it was in the code displaying the internal temperature of the electronics box of some industrial equipment. The string conversion was treating the temperature variable as an unsigned int when it was in fact signed. It took a brave field technician in Finland in winter, inspecting a unit in an unheated space to even discover this particular bug because the units' internal temperatures were usually about 20C above ambient.

    replies(2): >>43490560 #>>43493833 #
    8. efortis ◴[] No.43490557[source]
    Same here, we had an IE8 bug that prevented the initial voice over of the screen reader (JAWS). No dev could reproduce it because we all had DevTools open.
    replies(2): >>43493509 #>>43494922 #
    9. treyd ◴[] No.43490560{3}[source]
    This is a surprisingly common mistake with temperature readings. Especially when the system has a thermal safety power off that triggers if it's above some temperature, but then interprets -1 deg C as actually 255 deg C.
    replies(2): >>43490862 #>>43491874 #
    10. selimthegrim ◴[] No.43490862{4}[source]
    I have seen this on Walgreens signs in suburban New Orleans, oddly enough.
    11. devsda ◴[] No.43491034[source]
    During the time I was working on a mature hardware product in maintenance, if I think about the number of customer bugs we had to close due to being not-reproducible or were only present for a brief amount of time in specific setup, it was really embarassing and we felt like a bunch of noobs.
    12. jffhn ◴[] No.43491079[source]
    >Though abs() returning negative numbers is hilarious.

    Math.abs(Integer.MIN_VALUE) in Java very seriously returns -2147483648, as there is no int for 2147483648.

    replies(5): >>43491454 #>>43491920 #>>43494203 #>>43497622 #>>43499425 #
    13. dharmab ◴[] No.43491397[source]
    Bryan Cantril did a talk about this phenomenon called "Zebras all the way down" some years back
    replies(1): >>43491760 #
    14. eterm ◴[] No.43491454[source]
    You inspired me to check what .NET does in that situation.

    It throws an OverflowException: ("Negating the minimum value of a twos complement number is invalid.")

    15. latexr ◴[] No.43491760{3}[source]
    https://www.youtube.com/watch?v=fE2KDzZaxvE
    16. lukan ◴[] No.43491823[source]
    "To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added."

    My favourite are bugs, that not only don't appear in the debugger - but also don't reproduce anymore on normal settings after I took a closer look in the debugger (Only to come back later at a random time). Feels like chasing ghosts.

    replies(1): >>43492875 #
    17. shakna ◴[] No.43491874{4}[source]
    The rollout is still happening, but the new resident water meters for Victoria, Australia come with a temperature fix.

    Prior to this year, they could only handle 0-127 degrees for the water temperature. Which used to be sensible, but there were some issues with pressurised water starting to be delivered to houses resulting in negative temperatures being reported, like -125C, which immediately has the water switch off to prevent icing problems.

    The software side also switched from COBOL to Ada. So that's kewl.

    18. adrian_b ◴[] No.43491920[source]
    Unchecked integer overflow strikes again.
    19. rowanG077 ◴[] No.43492539[source]
    I don't think taking how long something took to debug in number of days is at all interesting. Trivial bugs can take weeks to debug for a noob. Insanely hard bugs takes hours to debug for genius devs, maybe even without any reproducer, just by thinking about it.
    20. dismalpedigree ◴[] No.43492555[source]
    I always refer to them as “quantum bugs” because the act of observing the bug changes the bug. Absolutely infuriating. I like “heisenbug” better. Has a better ring to it.
    21. steveBK123 ◴[] No.43492647[source]
    Yes ! I've dealt with complex issues that turned out to be vendor-swapped-hardware-woopsie which we spent over a month trying to solve in software before finally figuring it out.

    Part of it was difficulty of pinpointing the actual issue - fullness of drive vs throughput of writes.

    A lot of it was unfortunately organizational politics such that the system spanned two teams with different reporting lines that didn't cooperate well / had poor testing practices.

    replies(1): >>43492689 #
    22. voidifremoved ◴[] No.43492689[source]
    > A lot of it was unfortunately organizational politics

    The hardest bugs in my experience are those where your only source of vital information is a third party who is straight-up lying to you.

    replies(1): >>43501626 #
    23. btschaegg ◴[] No.43492875[source]
    Terminology proposal: "Gremlins" :)
    24. sesm ◴[] No.43493245[source]
    For stuff like this we used in-memory ring buffer logger that printed the logs on request. And it didn't save the strings, just necessary data bits and a pointer to formatting function. Writing to this logger didn't affect any timings.
    25. smrq ◴[] No.43493509[source]
    I can't remember the actual bug now, but one of my early career memories was hunting down an IE7 issue by using bookmarklets to alert() values. (Did IE7 even have dev tools?)
    replies(1): >>43494621 #
    26. mystified5016 ◴[] No.43493811[source]
    In hardware, you regularly see behavior change when you probe the system. Your oscilloscope or LA probes affect the system just enough to make a marginal circuit work. It's absolutely maddening.
    replies(1): >>43496221 #
    27. edarchis ◴[] No.43493833{3}[source]
    My brother is a wifi expert at a hw manufacturer. He once had a case where the customer had issues setting the transmit power to like 100 times the legal limit. They happened to be an offshore drilling platform and had an exemption for the transmission power as their antenna was basically on a buoy on the ocean. He had to convince the developer to fix that very specific bug.
    28. rhaps0dy ◴[] No.43494203[source]
    Oh no, Pytorch does the same thing:

    a = torch.tensor(-2*31, dtype=torch.int32) assert a == a.abs()

    replies(1): >>43497253 #
    29. camtarn ◴[] No.43494621{3}[source]
    There was a downloadable developer toolbar for IE6 and IE7, and scripts could be debugged in the external Windows Script Debugger. The developer toolbar even told you which elements had the famous hasLayout attribute applied, which completely changed how it was rendered and interacted with other objects, which was invaluable.
    30. gsck ◴[] No.43494922[source]
    I had a similar issue, worked fine when I was testing it on my machine, but I had dev tools open to see any potential issues.

    Turns out IE8 doesn't define console until the devtools are open. That caused me to pull a few hairs out.

    31. gertlex ◴[] No.43495177{4}[source]
    That makes sense. Thanks for the extra info!
    32. fuzzfactor ◴[] No.43496221[source]
    The closer you get to natural science, eventually reliance on logical troubleshooting can be "illogical".

    The more abundant the undefined (mis)behavior, the more you're going to be tearing your hair out.

    Almost the kind of frustration where you're supposed to have a logic-based system, and it rears it ugly head and defies logic anyway :\

    33. Adverblessly ◴[] No.43497018[source]
    > To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.

    A favourite of mine was a bug (specifically, a stack corruption) that I only managed to see under instrumentation. After a lot of debugging turns out that the bug was in the instrumentation software itself, which generated invalid assembly under certain conditions (calling one of its own functions with 5 parameters even though it takes only 4). Resolved by upgrading to their latest version.

    34. MawKKe ◴[] No.43497253{3}[source]
    numpy as well. and tensorflow
    35. bobbylarrybobby ◴[] No.43499425[source]
    Rust does the same in release, although it panics in debug.
    36. GianFabien ◴[] No.43501626{3}[source]
    Sometimes it isn't outright lying. I have had the issues with hardware, API and SDK documentation being subtly different from the product as shipped. With hardware with a mixture of revisions, some conforming to doco and other differing and even their engineers not being clear about which is which.
    37. pas ◴[] No.43503603{4}[source]
    Was there an option to roll back Google Docs to an earlier release? Or do shadow release A/B testing for a fraction of users?

    (And thanks for the war story!)