←back to thread

3883 points kuroguro | 2 comments | | HN request time: 0.548s | source
Show context
ufo ◴[] No.26297612[source]
The part that puzzles me the most was this comment about sscanf:

> To be fair I had no idea most sscanf implementations called strlen so I can’t blame the developer who wrote this.

Is this true? Is sscanf really O(N) on the size of the string? Why does it need to call strlen in the first place?

replies(1): >>26298300 #
JdeBP ◴[] No.26298300[source]
I think that the author hasn't checked them all. Even this isn't checking them all.

The MUSL C library' sscanf() does not do this, but does call memchr() on limited substrings of the input string as it refills its input buffer, so it's not entirely free of this behaviour.

* https://git.musl-libc.org/cgit/musl/tree/src/stdio/vsscanf.c

The sscanf() in Microsoft's C library does this because it all passes through a __stdio_common_vsscanf() function which uses length-counted rather than NUL-terminated strings internally.

* https://github.com/tpn/winsdk-10/blob/master/Include/10.0.16...

* https://github.com/huangqinjin/ucrt/blob/master/inc/corecrt_...

The GNU C library does something similar, using a FILE structure alongside a special "operations" table, with a _rawmemchr() in the initialization.

* https://github.com/bminor/glibc/blob/master/libio/strops.c#L...

* https://github.com/bminor/glibc/blob/master/libio/strfile.h#...

The FreeBSD C library does not use a separate "operations" table.

* https://github.com/freebsd/freebsd-src/blob/main/lib/libc/st...

A glib summary is that sscanf() in these implementations has to set up state on every call that fscanf() has the luxury of keeping around over multiple calls in the FILE structure. They're setting up special nonce FILE objects for each sscanf() call, and that involves finding out how long the input string is every time.

It is food for thought. How much could life be improved if these implementations exported the way to set up these nonce FILE structures from a string, and callers used fscanf() instead of sscanf()? How many applications are scanning long strings with lots of calls to sscanf()?

replies(6): >>26298762 #>>26298773 #>>26300532 #>>26301737 #>>26307663 #>>26352655 #
wnoise ◴[] No.26298762[source]
Wow. Thanks for looking.

> limited substrings of the input string as it refills its input buffer,

As far as I can tell, that copying helper function set to the read member of the FILE* never actually gets called in this path. I see no references to f->read() or anything that would call it. All of the access goes through shgetc and shunget, shlim, and shcnt, which directly reference the buf, with no copying. The called functions __intscan() and __floatscan() do the same. __toread() is called but just ensures it is readable, and possibly resets some pointers.

Even if it did, that pretty much does make it entirely free of this behavior, though not of added overhead. That operations structure stuffed into the file buffer doesn't scan the entire string, only copying an at most fixed amount more than asked for (stopping if the string terminates earlier than that). That leaves it linear, just with some unfortunate overhead.

I do find the exceedingly common choice of funneling all the scanf variants through fscanf to be weird. But I guess if they already have one structure for indirecting input, it's easy to overload that. (And somehow _not_ have a general "string as a FILE" facility, and building on top of that. (Posix 2008 does have fmemopen(), but it's unsuitable, as it is buffer with specified size (which would need to be calculated, as in the MS case), rather than not worried about until a NUL byte is reached.))

replies(2): >>26298852 #>>26324152 #
1. gnubison ◴[] No.26324152[source]
> Posix 2008 does have fmemopen(), but it's unsuitable, as it is buffer with specified size (which would need to be calculated, as in the MS case), rather than not worried about until a NUL byte is reached.

With fmemopen(), you only need to calculate the length once at the start, right? And then you can use the stream instead.

replies(1): >>26336275 #
2. wnoise ◴[] No.26336275[source]
Yes, you can do that. But libc can't use that as an implementation strategy without also having this linear-turned-quadratic behavior.