HTML as an Accessible Format for Papers (2023)

1. ComputerGuru ◴[06 Dec 25 19:21 UTC] No.46175885[source]▶

If the Unicode consortium would spend less time and effort on emoji and more on making the most common/important mathematical symbols and notations available/renderable in plain text, maybe we could move past the (LA)TeX/PDF marriage. OpenType and TrueType now (edit: for well over a decade, actually) support the necessary conditional rendering required to perform complicated rendering operations to get sequences of Unicode code points to display in the way needed (theoretically, anyway) and with fallback missing-glyph-only font family substitution support available pretty much everywhere allowing you to seamlessly display symbols not in your primary font from a fallback asset (something like Noto, with every Unicode symbol supported by design, or math-specific fonts like Cambria Math or TeX Gyre, etc), there are no technical restrictions.

I’ve actually dug into this in the past and it was never lack of technical ability that prevented them from even adding just proper superscript/subscript support before, but rather their opinion that this didn’t belong in the symbolic layer. But since emoji abuse/rely on ZWJ and modifiers left and right to display in one of a myriad of variations, there’s really no good reason not to allow the same, because 2 and the squares symbol are not semantically the same (so it’s not a design choice).

An interesting (complete) tangent is that Gemini 3 Pro is the only model I’ve tested (I do a lot of math-related stuff with LLMs) that absolutely will not under any circumstances respect (system/user) prompt requests to avoid inline math mode (aka LATeX) in the output, regardless of whether I asked for a blanket ban on TeX/MathJax/etc or when I insisted that it use extended unicode codes points to substitute all math formula rendering (I primarily use LLMs via the TUI where I don’t have MathJax support, and as familiar as I once was with raw TeX mathematical notations and symbols, it’s still quite easy to confuse unrendered raw output by missing something if you’re not careful). I shared my experiment and results here – Gemini 3 Pro would insist on even rendering single letter constants or variables as $k$ instead of just k (or k in markdown italics, etc) no matter how hard I asked it not to (which makes me think it may have been overfit against raw LATeX papers, and is also an interesting argument in favor of the “VL LLMs are the more natural construct”): https://x.com/NeoSmart/status/1995582721327071367?s=20

replies(6): >>46176213 #>>46176305 #>>46176395 #>>46176933 #>>46178437 #>>46182200 #

2. hannahnowxyz ◴[06 Dec 25 20:06 UTC] No.46176213[source]▶

>>46175885 (TP) #

Have you tried a two-pass approach? For example, where prompt #1 is "Which elliptic curves have rational parameterizations?", and then prompt #2 (perhaps to a smaller/faster model like Gemma) is "In the following text, replace all LaTeX-escaped notation with Markdown code blocks and unicode characters. For example, $F_n = F_{n - 1} + F_{n - 2}$ should be replaced with `Fₙ = Fₙ₋₁ + Fₙ₋₂`. <Response from prompt #1>". Although it's not clear how you would want more complex things to be converted.

replies(3): >>46176342 #>>46176358 #>>46180256 #

3. moelf ◴[06 Dec 25 20:20 UTC] No.46176305[source]▶

>>46175885 (TP) #

https://github.com/stevengj/subsuper-proposal

4. baby ◴[06 Dec 25 20:26 UTC] No.46176342[source]▶

>>46176213 #

I've done latex -> mathml -> markdown and it works quite well

5. yannis ◴[06 Dec 25 20:29 UTC] No.46176358[source]▶

>>46176213 #

It is actually quicker to ask using LaTeX markup!

6. crazygringo ◴[06 Dec 25 20:34 UTC] No.46176395[source]▶

>>46175885 (TP) #

I don't understand. No matter what fancy things you do with superscripts and subscripts, you're not going to be able to do even basic things you need for equations like use a fraction bar, or parentheses that grow in height to match the content inside them.

At a fundamental level, Unicode is for characters, not layout. Unicode may abuse the ZWJ for emoji, but it still ultimately results in a single emoji character, not a layout of characters. So I don't really understand what you're asking for.

replies(2): >>46177265 #>>46178082 #

7. SOTGO ◴[06 Dec 25 21:58 UTC] No.46176933[source]▶

>>46175885 (TP) #

I'm almost surprised that Gemini 3 uniquely has this problem. I would have expected that responses from any LLM that require complex math notation would almost certainly be LaTeX heavy, given the abundance of LaTeX source material in the training data. I suppose it is a flaw if a model can't avoid LaTeX, but given that it is the standard (and for the foreseeable future too) I don't know what appropriate output would look like. For "pure" mathematics or similar topics I think LaTeX (or system that represents a superset of LaTeX) is the only acceptable option.

8. lukan ◴[06 Dec 25 22:43 UTC] No.46177265[source]▶

>>46176395 #

Agreed. I think MathML is intended for layout of formulas and integrated into browsers nowdays, but I never used it, so don't know if essentials are missing?

9. bsder ◴[07 Dec 25 00:33 UTC] No.46178082[source]▶

>>46176395 #

> No matter what fancy things you do with superscripts and subscripts, you're not going to be able to do even basic things you need for equations like use a fraction bar, or parentheses that grow in height to match the content inside them.

Why not? Things like Arabic ligatures already do that, no?

replies(2): >>46178191 #>>46182878 #

10. austinjp ◴[07 Dec 25 00:49 UTC] No.46178191{3}[source]▶

>>46178082 #

This is interesting to me, but I am very naive about this. Can you explain, or point to where I could learn more?

replies(1): >>46179600 #

11. raincole ◴[07 Dec 25 01:40 UTC] No.46178437[source]▶

>>46175885 (TP) #

Math formulas are far far far more complex than unicode emojis. I don't even know how to start comparing them.

12. bsder ◴[07 Dec 25 06:23 UTC] No.46179600{4}[source]▶

>>46178191 #

I'd start with HarfBuzz: https://github.com/harfbuzz/harfbuzz

That's the open source font shaping engine. It does a lot of work to handle font shaping and rendering for languages that can't really be reduced to characters.

13. toastal ◴[07 Dec 25 08:58 UTC] No.46180256[source]▶

>>46176213 #

reStructuredText support :math: roles. AsciiDoc has stem blocks. Why do folks keep trying to shoehorn Markdown into everything, creating yet another fork, when there are other lightweight markup languages that support actual features for technical blogs/documentation?

14. franga2000 ◴[07 Dec 25 15:08 UTC] No.46182200[source]▶

>>46175885 (TP) #

The whole "we need latex because of math" thing has been nothing more than a bad excuse for a very long time. Math notation is too varied to include in Unicode (some papers have to invent new notation!), but even if we had it, authors would still insist on latex. You can already make responsive and largely accessible papers that render to HTML, with latex familiar syntax for equations, bibtex for references and all the footnotes/figures/tables/captions you might want.

But authors still refuse. It's not real science if the layout isn't two-column, written in an old serif font, tables and figures float randomly disconnected from their reference points, code isn't syntax higlighted and has completely nonsensical line breaks... If the reader wants to read it on a phone, or needs to change to font to be larger or more legible, they're not a real scientist and don't deserve to read real papers.

Seriously, what the fuck?? Even the economists are laughing at us with their MS Word and third-party cloud-based bibliography plugin subscription.

replies(1): >>46184262 #

15. bruce343434 ◴[07 Dec 25 16:26 UTC] No.46182878{3}[source]▶

>>46178082 #

Arabic ligatures? Do you mean the unicode point for the basmala for instance? That's pretty "hardcoded", I think math requires more composability

16. gus_massa ◴[07 Dec 25 19:21 UTC] No.46184262[source]▶

>>46182200 #

Authors just follow any format mandated by the journals.

In unoficial notes for the classes, most authors use single column, and try to remember the magic spell to keep the figures in place. Something like [H!] ???

Also most books are single column.