←back to thread

1070 points dondraper36 | 2 comments | | HN request time: 0s | source
Show context
GMoromisato ◴[] No.45069016[source]
One of the ironies of this kind of advice is that it's best for people who already have a lot of experience and have the judgement to apply it. For instance, how do you know what the "simplest thing" is? And how can you be sure that it "could possibly work"?

Yesterday I had a problem with my XLSX importer (which I wrote myself--don't ask why). It turned out that I had neglected to handle XML namespaces properly because Excel always exported files with a default namespace.

Then I got a file that added a namespace to all elements and my importer instantly broke.

For example, Excel always outputs <cell ...> whereas this file has <x:cell ...>.

The "simplest thing that could possibly work" was to remove the namespace prefix and just assume that we don't have conflicting names.

But I didn't feel right about doing that. Yes, it probably would have worked fine, but I worried that I was leaving a landmine for future me.

So instead I spent 4 hours re-writing all the parsing code to handle namespaces correctly.

Whether or not you agree with my choice here, my point is that doing "the simplest thing that could possible work" is not that easy. But it does get easier the more experience you have. Of course, by then, you probably don't need this advice.

replies(11): >>45069191 #>>45069245 #>>45069268 #>>45069600 #>>45070183 #>>45070459 #>>45072910 #>>45073086 #>>45075511 #>>45076327 #>>45077197 #
jiggawatts ◴[] No.45070183[source]
Don't confuse sloppy with simple. Parsing XML with regex[1] (or a non-namespace-compliant XML parser) is not simple. It's messy, verbose, error-prone, and not in any way idiomatic or simple.

If you had just used a compliant XML parser as intended, you might not even have noticed that different encodings of namespaces was even occurring in the files! It just "doesn't register" when you let the parser handle this for you in the same sense that if you parse HTML (or XML) properly, then you won't notice all of the &amp; and &lt; encodings either. Or CDATA. Or Unicode escapes. Or anything else for that matter that you may not even be aware of.

You may be a few more steps away from making an XLSX importer work robustly. Did you read the spec? The container format supports splitting single documents into multiple (internal) files to support incremental saves of huge files. That can trip developers in the worst way, because you test with tiny files, but XLSX-handling custom code tends to be used to bulk import large files, which will occasionally use this splitting. You'll lose huge blocks of data in production, silently! That's not fun (or simple) to troubleshoot.

The fast, happy path is to start with something like System.IO.Packaging [2] which is the built-in .NET libary for the Open Packaging Conventions (OPC) container format, which is the underlying container format of all Office Open XML (OOXML) formats. Use the built-in XML parser, which handles namespaces very well. Then the only annoyance is that OOXML formats have two groups of namespaces that they can use, the Microsoft ones and the Open "standardised" ones.

[1] Famously! https://stackoverflow.com/questions/8577060/why-is-it-such-a...

[2] https://learn.microsoft.com/en-us/dotnet/api/system.io.packa...

replies(1): >>45070764 #
GMoromisato ◴[] No.45070764[source]
Parsing XML is relatively trivial--I'd never use regex, of course, but a basic recursive descent parser can do it pretty easily. I mean, the whole point of XML is that it's supposed to be easy to parse and generate!

Namespaces add a wrinkle, but it wasn't that hard to add. And I was able to add namespace aliasing in my API to handle the two separate "standard" namespaces that you're talking about.

But you're right about OPC/OOXML--those are massive specs and even the tiny slice that I'm handling has been error-prone. I haven't dealt with multiple internal files, so that's a future bug waiting for me. The good news is I'm building a nice library of test files for my regression tests!

replies(1): >>45070940 #
1. jiggawatts ◴[] No.45070940{3}[source]
> Parsing XML is relatively trivial

It really isn't, and rolling your own parser is the diametric opposite of the "do the simplest thing" philosophy.

The XML v1.1 spec is 126 KB of text, and that doesn't even include XML Namespaces, which is a separate spec with 25 KB of text.

XML is only "simple" in the sense of being well-defined, which makes interoperability simple, in some sense. Contrast this with ill-defined or implementation-defined text formats, where it's decidedly not simple to write an interoperable parser.

As an end-user of XML, the simplest thing is to use an off-the-shelf XML parser, one that's had the bugs beaten out of it by millions of users.

There are very few programming languages out that don't have a convenient, full-featured XML parser library ready to use.

replies(1): >>45071486 #
2. GMoromisato ◴[] No.45071486[source]
Well we can agree that most people shouldn't implement their own XML parser.