Most active commenters
  • JdeBP(4)
  • wnoise(3)

←back to thread

3883 points kuroguro | 13 comments | | HN request time: 1.017s | source | bottom
Show context
ufo ◴[] No.26297612[source]
The part that puzzles me the most was this comment about sscanf:

> To be fair I had no idea most sscanf implementations called strlen so I can’t blame the developer who wrote this.

Is this true? Is sscanf really O(N) on the size of the string? Why does it need to call strlen in the first place?

replies(1): >>26298300 #
1. JdeBP ◴[] No.26298300[source]
I think that the author hasn't checked them all. Even this isn't checking them all.

The MUSL C library' sscanf() does not do this, but does call memchr() on limited substrings of the input string as it refills its input buffer, so it's not entirely free of this behaviour.

* https://git.musl-libc.org/cgit/musl/tree/src/stdio/vsscanf.c

The sscanf() in Microsoft's C library does this because it all passes through a __stdio_common_vsscanf() function which uses length-counted rather than NUL-terminated strings internally.

* https://github.com/tpn/winsdk-10/blob/master/Include/10.0.16...

* https://github.com/huangqinjin/ucrt/blob/master/inc/corecrt_...

The GNU C library does something similar, using a FILE structure alongside a special "operations" table, with a _rawmemchr() in the initialization.

* https://github.com/bminor/glibc/blob/master/libio/strops.c#L...

* https://github.com/bminor/glibc/blob/master/libio/strfile.h#...

The FreeBSD C library does not use a separate "operations" table.

* https://github.com/freebsd/freebsd-src/blob/main/lib/libc/st...

A glib summary is that sscanf() in these implementations has to set up state on every call that fscanf() has the luxury of keeping around over multiple calls in the FILE structure. They're setting up special nonce FILE objects for each sscanf() call, and that involves finding out how long the input string is every time.

It is food for thought. How much could life be improved if these implementations exported the way to set up these nonce FILE structures from a string, and callers used fscanf() instead of sscanf()? How many applications are scanning long strings with lots of calls to sscanf()?

replies(6): >>26298762 #>>26298773 #>>26300532 #>>26301737 #>>26307663 #>>26352655 #
2. wnoise ◴[] No.26298762[source]
Wow. Thanks for looking.

> limited substrings of the input string as it refills its input buffer,

As far as I can tell, that copying helper function set to the read member of the FILE* never actually gets called in this path. I see no references to f->read() or anything that would call it. All of the access goes through shgetc and shunget, shlim, and shcnt, which directly reference the buf, with no copying. The called functions __intscan() and __floatscan() do the same. __toread() is called but just ensures it is readable, and possibly resets some pointers.

Even if it did, that pretty much does make it entirely free of this behavior, though not of added overhead. That operations structure stuffed into the file buffer doesn't scan the entire string, only copying an at most fixed amount more than asked for (stopping if the string terminates earlier than that). That leaves it linear, just with some unfortunate overhead.

I do find the exceedingly common choice of funneling all the scanf variants through fscanf to be weird. But I guess if they already have one structure for indirecting input, it's easy to overload that. (And somehow _not_ have a general "string as a FILE" facility, and building on top of that. (Posix 2008 does have fmemopen(), but it's unsuitable, as it is buffer with specified size (which would need to be calculated, as in the MS case), rather than not worried about until a NUL byte is reached.))

replies(2): >>26298852 #>>26324152 #
3. ufo ◴[] No.26298773[source]
Oh dear... that's one of the biggest footguns I've ever seen in all my years of working with C.
replies(1): >>26301789 #
4. JdeBP ◴[] No.26298852[source]
You've missed what happens in __uflow() when __toread() does not return EOF. (And yes, that does mean occasional memchr() of single characters and repeated memchr()s of the same memory block.)
replies(1): >>26299491 #
5. wnoise ◴[] No.26299491{3}[source]
Ah, I did indeed. Wacky.
6. JdeBP ◴[] No.26300532[source]
Addendum: There are C library implementations that definitely do not work this way. It is possible to implement a C library sscanf() that doesn't call strlen() first thing every time or memchr() over and over on the same block of memory.

Neither P.J. Plauger's nor my Standard C library (which I wrote in the 1990s and used for my 32-bit OS/2 programs) work this way. We both use simple callback functions that use "void*"s that are opaque to the common internals of *scanf() but that are cast to "FILE*" or "const char*" in the various callback functions.

OpenWatcom's C library does the same. Things don't get marshalled into nonce FILE objects on every call. Rather, the callback functions simply look at the next character to see whether it is NUL. They aren't even using memchr() calls to find a NUL in the first position of a string. (-:

* http://perforce.openwatcom.org:4000/@md=d&cd=//depot/V2/src/...

replies(1): >>26302915 #
7. pja ◴[] No.26301737[source]
OpenBSD is also doing the same thing. It seems almost universal, unless the libc author has specifically gone out of their way to do something different!
8. pja ◴[] No.26301789[source]
It is! Not mentioned anywhere in the manpages either & there’s no a priori reason for sscanf() to need to call strlen() on the input string, so most programmers would never expect it to.

Pretty sure I would have made this error in the same situation, no question.

9. JdeBP ◴[] No.26302915[source]
Addendum: The C library on Tru64 Unix didn't work that way either, reportedly.

* https://groups.google.com/g/comp.lang.c/c/SPOnRZ3nEHk/m/dAoB...

10. pdw ◴[] No.26307663[source]
> How much could life be improved if these implementations exported the way to set up these nonce FILE structures from a string

That's fmemopen. Not widespread, but at least part of POSIX these days.

11. gnubison ◴[] No.26324152[source]
> Posix 2008 does have fmemopen(), but it's unsuitable, as it is buffer with specified size (which would need to be calculated, as in the MS case), rather than not worried about until a NUL byte is reached.

With fmemopen(), you only need to calculate the length once at the start, right? And then you can use the stream instead.

replies(1): >>26336275 #
12. wnoise ◴[] No.26336275{3}[source]
Yes, you can do that. But libc can't use that as an implementation strategy without also having this linear-turned-quadratic behavior.
13. froh ◴[] No.26352655[source]
fmemopen is standard these days.

just wrap the string into a FILE, explicitly setting the buffer size to strlen(s), use fscanf the loop and fasten your seatbelts...

https://pubs.opengroup.org/onlinepubs/9699919799/functions/f...