HTML as an Accessible Format for Papers (2023)

(info.arxiv.org)

262 points el3ctron | 1 comments | 06 Dec 25 14:59 UTC | HN request time: 0.211s | source

Show context

billconan ◴[06 Dec 25 16:42 UTC] No.46174647[source]▶

I don't think HTML is the right approach. HTML is better than PDF, but it is still a format for displaying/rendering.

the actual paper content format should be separated from its rendering.

i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn't have font sizes, layout etc.

the viewer platforms then should be able to style the content differently.

replies(5): >>46174655 #>>46174732 #>>46174842 #>>46175075 #>>46175479 #

dimal ◴[06 Dec 25 16:53 UTC] No.46174732[source]▶

>>46174647 #

Perfect is the enemy of good. HTML is good enough. Let’s get this done.

And as another commenter has pointed out, HTML does exactly what you ask for. If it’s done correctly, it doesn’t contain font sizes or layout. Users can style HTML differently with custom CSS.

replies(1): >>46174778 #

billconan ◴[06 Dec 25 17:00 UTC] No.46174778[source]▶

>>46174732 #

mixing rendering definitions with content (PDF) is something from the printer era, that is unsuitable for the digital era.

HTML was a digital format, but it wanted to be a generic format for all document types, not just papers, so it contains a lot of extras that a paper format doesn't need.

for research papers, since they share the same structure, we can further separate content from rendering.

for example, if you want to later connect a paper with an AI, do you want to send <div class="abstract"> ... ?

or do some nasty heuristic to extract the abstract? like document. getElementsByClassName("abstract")[0] ?

replies(1): >>46174865 #

1. simonw ◴[06 Dec 25 17:11 UTC] No.46174865[source]▶

>>46174778 #

All of the interesting LLMs can handle a full paper these days without any trouble at all. I don't think it's worth spending much time optimizing for that use-case any more - that was much more important two years ago when most models topped out at 4,000 or 8,000 tokens.

↑