←back to thread

90 points rbanffy | 4 comments | | HN request time: 0.626s | source
Show context
wduquette ◴[] No.45132286[source]
The OP says that 8-bit CPUs couldn't handle Pascal well, and that Action! (release in 1983) was the first IDE for 8-bit machines.

But Apple Pascal was released for the Apple II in 1979. Based on UCSD Pascal, the Apple Pascal system was basically an OS that simply was an IDE; and it worked perfectly well on 8-bit hardware. I had quite a lot of fun with it back in the day.

replies(7): >>45132536 #>>45133065 #>>45133081 #>>45133100 #>>45133166 #>>45134524 #>>45136891 #
brucehoult ◴[] No.45134524[source]
As others have pointed out, Apple Pascal was a bytecode interpreter, not compiled to native 6502 code.

Even worse, the bytecode was standard UCSD Pascal, designed for mainframes, and not really how you'd design bytecode if you knew you were going to implement it on a 6502.

We can actually compile Pascal / C style languages to the 6502 pretty well these days. The way to do it is treat it the same as a RISC machine with a lot of registers (let's say 32 2-byte pairs in zero page, leaving 75% of ZP for globals and a handful of important runtime functions that benefit from self-modifying code). Split the 32 pseudo-registers up into A, S, T registers the same as you do for any RISC (or x86_64 for that matter).

It's problem to handle recursion. Just the same as any function call, put anything you still want after the function call into S registers, and in the called function if it needs to use S registers then it needs to save them on a stack (NOT the 256 byte hardware stack) first and restore them after. For more than one or two saved registers it's better to use a runtime function for this -- 12 cycles overhead, but lots of bytes of code saved. RISC-V's -msave-restore provides a good model for this.

You can make a nice set of 2-address arithmetic routines between 16 or 32 bit Zero-Page registers by using the X and Y registers to pass which locations to work on.

    add16:
        clc
        lda 0,x
        adc 0,y
        sta 0,x
        lda 1,x
        adc 1,y
        sta 1,x
        ret
    
        ...
        ldx #dst
        ldy #src
        jsr add16
This reduces the call site from 13 bytes to 7 bytes (and you can often reuse an X or Y), while increasing the execution time for an inline add from 20 cycles to 42 (1 cycle extra per instruction for the indexed addressing = 6, 4 cycles for loading X and Y, 12 cycles for the jsr/ret)

For 32 bit arithmetic the time overhead is much reduced. For floating point it would be negligable!

There is no convenient way to make 3-address routines, but `mov16` is only 5 instructions and 28 cycles. Or just 4 instructions (8 bytes) and 12 cycles if inlined, which might be a better tradeoff i.e.

        lda src1
        sta dst
        lda src1+1
        sta dst+1
        ldx #dst
        ldx #src2
        jsr add16
vs

        ldx #dst
        ldy #src1
        jsr mov16
        ldy #src2
        jsr add16

Stack-based 6502 code is both bigger and slower than (pseudo) register-based code -- a zero-argument `jsr add16` itself is smaller, but it's significantly slower, and loading and storing values between stack and somewhere else will be much more code and much slower.

Stack-based bytecode has a whole lot more overhead again to fetch the next instruction and dispatch to the correct code for it. Register-based bytecode would be a much better idea, as it uses several times fewer bytecode instructions, and is more compact to boot (see Dalvik vs JVM ... sad that webasm didn't pay attention).

replies(1): >>45160971 #
kragen ◴[] No.45160971[source]
Typically I've found that stack-based bytecode uses about twice as many instructions as register-based bytecode but significantly less space. (I'm sorry that's kind of handwavy; I'd be delighted to see a more rigorous study of the question.) For Wasm I don't think the number of instructions was a concern, since that only affects the time to "load" it by compiling to native code, but size seems to have been a primary concern.

I agree about the general approach to taking advantage of the zero page, but I don't have the 6502 competency to comment on your proposed instruction sequences.

replies(1): >>45163104 #
brucehoult ◴[] No.45163104[source]
> stack-based bytecode uses about twice as many instructions as register-based bytecode but significantly less space

That's not my experience, and I don't see how it can be true, even theoretically, for typical programs.

I don't at all mind more instructions, if that works out smaller or at least not bigger. E.g. RISC-V using `c.slli; c.srli` or `c.slli; c.srai` (all 2-byte opcodes) to extract unsigned or signed bitfields instead of having a 4-byte "extract field" instruction.

It is true that a stack-based ISA can have a size advantage in evaluating very complex expressions, such as those typical in scientific / engineering computing, when you get a lot of arithmetic operators in a row with no arguments needed.

But in general purpose programs the vast majority of lines of C code have only very simple expressions. `x=x+y`. `x=x+10`. `if x<y`.

Register ISAs provide short addressing for the most frequently-used 8 or 16 or 32 variables in a function (the registers). Experience shows 8 isn't really enough, 32 is usually a little more than you need -- I think some study showed 24 was enough but you don't get a code size advantage from that, so ...

Stack based ISAs do similarly. JVM can load or store any of four local variables within a 1-byte instruction. For more than than you can refer to 256 variables with a 2-byte instruction. Four really isn't enough, but 256 is a waste of code size. WASM doesn't have any option to include a variable number in the main opcode byte -- `local.get 1` is two bytes.

A very typical statement in programs is `x += y`.

JVM 4 bytes:

    iload 0  // 1A
    iload 1  // 1B
    iadd     // 60
    istore 0 // 3B
If x and y aren't in the first 4 locals then it's 7 bytes

WASM 7 bytes:

    (local.get 0) // 20 00
    (local.get 1) // 20 01
    (i32.add)     // 6A
    (local.set 0) // 21 00
ARMv7 / ARMv6-M / ARMv4T 2 bytes:

    add r0, r0, r1 // 08 44
RISC-V with C extension 2 bytes:

    add a0, a0, a1 // 2E 95
The Arm example works with 16 local variables (only the most common operations MOV, ADD, and CMP can use all 16 registers, other operations can only use 8), the RISC-V example works with all 32 registers (again only C.MV, C.ADD, C.ADDI, C.SLLI work with all 32 registers, other 2-byte opcodes use only 8).

8086, M68k, PDP-11, Super-H, MSP430 can all also do this very common operation with a 2-byte instruction, with anything from 8 to 16 local variables.

Exactly the same analysis applies to `x += 10`.

Most programs consist almost entirely of statements like these, not things like `det = a(ei − fh) − b(di − fg) + c(dh − e*g)`

There are two main problems with stack code:

- you need four opcodes: `load`, `load`, `add`, `store`. Each opcode needs at minimum 3 or 4 bits, so no matter what you do you're going to have 12-16 bits to specify four actions. In contrast, the register machine can just use a single `add` opcode, again using 3 or 4 bits.

- all the common stack ISAs use a minimum of 8 bits for an instruction (even the extremely common `iadd` or `i32.add` and use a multiple of 8 bits for every instruction. Four instructions is going to be minimum 4 bytes.

- the register machine needs only one 3 or 4 bit opcode for `add` AND can pack the other operands arbitrarily into a 16-bit instruction, using whatever field size and position makes sense.

- the register machine needs to mention the dst/rs1 variable only once.

Stack machine code could be improved in size a little if you bit-packed the instructions, but you'll still have the problems of needing more opcode fields and needing to repeat variable names.

Perhaps ironically, some accumulator machines do better. For example the much-hated PIC microcontroller actually does this well:

    MOVF  0x21, W    ; Load y (from address 0x21) into W register
    ADDWF 0x20, F    ; Add W (containing y) to x (at address 0x20), store result in x
Instructions such as ADDWF have a bit to specify whether to put the result of the add into W (the accumulator) or the source register.

On the smaller chips each of those instructions is 12 bits (with 32 registers, many with special purposes), though those are very limited and most applications use mid-range chips with 14 bit instructions (allowing 128 directly accessible registers).

So this is 24 or 28 bits of code, still not as compact as the 16 bits for that extensive list of register machines: RISC-V, Arm Thumb, 8086, M68k, PDP-11, Super-H, MSP430.

replies(1): >>45163508 #
kragen ◴[] No.45163508[source]
Thank you! This is indeed very thought-provoking!

It certainly does depend on which stack machine. I'm surprised that Wasm doesn't have a one-byte encoding for (local.get 0) or (local.set 0)! As you point out, even the JVM has one. My survey of bytecode instruction sets in https://dercuano.github.io/notes/tiny-interpreters-for-micro... found such an 8-bit or 5-bit instruction in nearly every example except CPython, with a 3-to-5-bit operand field in the local.get and local.set instruction byte, but of course Wasm didn't exist in 02007. The JVM only offering 2 bits is atypically stingy.

RVC and Thumb2 are pretty extreme cases of optimized register-machine encoding, and probably not something you'd want to decode in software on a 6502. The comparable stack-machine code would be something like Chuck Moore's x18/F18A with its four instructions per 18-bit word or James Bowman's J1A.

It also depends, as you say, on the code. Most code is somewhere in between the extremes of x += y and your example of

    det = a*(e*i − f*h) − b*(d*i − f*g) + c*(d*h − e*g)
with statements like these:

  dstRect.bottom := Max(dstRect.bottom,dstRect.top + txMinHeight);
  thePort^.grafProcs := Nil;   { restore to normal }
  if (n != 0 && --n != 0) ... 
  target_command = entry_i + CAM_CONTENT_COUNT * j;
  tfd = open(ptr, O_RDWR|O_TRUNC|O_EXCL|O_CREAT, 0600);
  for (size_t i = 1; i < n; i++) ...
  if (event == EV_enter_state) ... else if (event == EV_tick) ... else if (event == EV_reenter_state) ...
Those are all taken from my examples in https://dernocua.github.io/notes/c-stack-bytecode.html. There are a lot of subroutine calls, some reuse of the same value in multiple computations, many more variable reads than writes, and an occasional infix expression with more than one operator. These are all factors that favor stack instruction sets more than x += y does.

Subroutine calls in particular are typically a single bytecode once all the operands are on the stack, referring to a subroutine name or address stashed elsewhere with a short literal index, which similarly is usually allocated a 3–5-bit field in the opcode byte. Very popular subroutines like car or #at:put: can get their own bytecode.

There's a bit of a difference between the 8 registers in the PDP-11 or the 8086 and the 3-bit local variable field in an Emacs Lisp bytecode byte. The stack pointer soaks up one of those slots; on the PDP, the program counter soaks up another. (On ARM the link register sometimes is a third. Thumb and RVC do better here by excluding these registers from their 3-bit namespaces.) Analogously, one of the 8 local-variable indices in Elisp is "see next byte". Also, though, stack machines don't need to spend local variable indices on temporaries, and they often have instance-variable-access opcodes as well. So in effect register machines have less local variables than it at first appears.

Incidentally, as that note mentions, Aztec C for the 6502 did support a bytecode interpreter to more than halve code size:

> As an alternative, the pseudo-code C compiler, CCI, produces machine language for a theoretical machine with 8, 16 and 32 bit capabilities. This machine language is interpreted by an assembly language program that is about 3000 bytes in size.

> The effects of using CCI are twofold. First, since one instruction can manipulate a 16 or 32 bit quantity, the size of the compiled program is generally more than fifty percent smaller than the same program compiled with C65 [the Aztec C native code compiler for the 6502]. However, interpreting the pseudo-code incurs an overhead which causes the execution speed to be anywhere from five to twenty times slower.

But I don't know whether it was a stack bytecode, an accumulator bytecode, a register bytecode, or something weirder.

My experience programming in FORTH is that often I can profitably keep about one variable on the operand stack, so accessing that variable takes no instructions and no bytes. In theory this is something that a compiler could do, but the only case I know of is when you store to a variable in Smalltalk inside a larger expression. Similarly it's common in FORTH to spend no instructions on accessing arguments or returning a value; you just don't drop it before returning.

replies(2): >>45164157 #>>45164800 #
1. brucehoult ◴[] No.45164800[source]
With respect to your tiny interpreters page, I'd just like to point out that Thumb code for recursive fib() is 26 bytes, very close to Squeak (this is out of GCC with -Os, I haven't checked if I can do better manually).

    00010448 <fib>:
       10448:       b570            push    {r4, r5, r6, lr}
       1044a:       0004            movs    r4, r0
       1044c:       2500            movs    r5, #0
       1044e:       2c01            cmp     r4, #1
       10450:       dd05            ble.n   1045e <fib+0x16>
       10452:       1e60            subs    r0, r4, #1
       10454:       f7ff fff8       bl      10448 <fib>
       10458:       3c02            subs    r4, #2
       1045a:       182d            adds    r5, r5, r0
       1045c:       e7f7            b.n     1044e <fib+0x6>
       1045e:       1960            adds    r0, r4, r5
       10460:       bd70            pop     {r4, r5, r6, pc}
RISC-V is 30 bytes using newish instructions found on e.g. RP2350

    00010186 <fib>:
       10186:       b872                    cm.push {ra,s0-s2},-16
       10188:       842a                    mv      s0,a0
       1018a:       4481                    li      s1,0
       1018c:       4905                    li      s2,1
       1018e:       00895863                bge     s2,s0,1019e <fib+0x18>
       10192:       fff40513                addi    a0,s0,-1
       10196:       3fc5                    jal     10186 <fib>
       10198:       1479                    addi    s0,s0,-2
       1019a:       94aa                    add     s1,s1,a0
       1019c:       bfcd                    j       1018e <fib+0x8>
       1019e:       00940533                add     a0,s0,s1
       101a2:       be72                    cm.popret       {ra,s0-s2},16
The instructions are identical, the difference comes from RISC-V not being able to encode 3-address adds in 2-byte instructions while Arm can.

Falling back to original 2015 RV32IC (no newfangled extensions) adds 2 bytes to 32:

    0001018a <fib>:
       1018a:       050002ef                jal     t0,101da <__riscv_save_0>
       1018e:       842a                    mv      s0,a0
       10190:       4481                    li      s1,0
       10192:       4905                    li      s2,1
       10194:       00895863                bge     s2,s0,101a4 <fib+0x1a>
       10198:       fff40513                addi    a0,s0,-1
       1019c:       37fd                    jal     1018a <fib>
       1019e:       1479                    addi    s0,s0,-2
       101a0:       94aa                    add     s1,s1,a0
       101a2:       bfcd                    j       10194 <fib+0xa>
       101a4:       00940533                add     a0,s0,s1
       101a8:       a899                    j       101fe <__riscv_restore_0>
MSP430 comes to 38 bytes, primarily because all literals and offsets are a full 16 bits.

    00000000 <fib>:
       0:   1a 15           pushm   #2,     r10     ;16-bit words
       2:   0a 4c           mov     r12,    r10     ;
       4:   49 43           clr.b   r9              ;
    
    00000006 <.L3>:
       6:   5c 43           mov.b   #1,     r12     ;r3 As==01
       8:   0c 9a           cmp     r10,    r12     ;
       a:   00 34           jge     $+2             ;abs 0xc
       c:   0c 4a           mov     r10,    r12     ;
       e:   3c 53           add     #-1,    r12     ;r3 As==11
      10:   b0 12 00 00     call    #0              ;
      14:   3a 50 fe ff     add     #-2,    r10     ;#0xfffe
      18:   09 5c           add     r12,    r9      ;
      1a:   30 40 00 00     br      #0x0000         ;
    
    0000001e <.L2>:
      1e:   0c 4a           mov     r10,    r12     ;
      20:   0c 59           add     r9,     r12     ;
      22:   19 17           popm    #2,     r10     ;16-bit words
      24:   30 41           ret                     
But as mentioned in another message it's a very simply-encoded ISA which would be easy to write an interpreter for e.g.

    0a 4c           mov     r12,    r10
    
    0: register addressing mode, Word size
    a: dst = r10
    4: mov
    c: src = r12
replies(1): >>45166149 #
2. kragen ◴[] No.45166149[source]
Both the RISC-V and Thumb code here are no longer recursive fib; GCC has quite correctly transformed one of the recursions into a normal nonrecursive loop by exploiting the commutative and associative properties of integer addition. I infer that perhaps the same thing may be true of the MSP430 code from the presence of only a single call instruction, but I don't see the loop; perhaps the br to 0 will be relinked to go to .L3 before execution?

I haven't seen MSP430 code before, due to my somewhat embarrassing and unfair distaste for TI as a company, and I agree that it is a very nice encoding indeed!

If you insist that the code contain at least two recursive calls, two subtractions, an addition, and a conditional for the base case, even Thumb2 and RVC no longer look so competitive with stack machines. If not, there is no particular reason to leave one of the recursions in; you may as well optimize that out too, thus reducing an exponential-time algorithm to a linear-time one (or linear to logarithmic, if the parameter is the number of bits of input.) This also makes it smaller, something like (untested):

      mov r1, #1
      mov r2, #1
  1:  subs r0, #1
      itt lo
      movlo r0, r1
      bxlo lr
      mov r3, r2
      adds r2, r1
      mov r1, r3
      b 1b
I think that's maybe 22 bytes?

But I don't think that tells us much about the relative densities of stack code, register code, accumulator code, and belt code, because it's a very different mix of operations.

replies(2): >>45167017 #>>45167492 #
3. kragen ◴[] No.45167017[source]
That's 24 bytes because gas is using Thumb2 mov.w instructions to initialize r1 and r2. It took me a long time to figure out that the problem was mov vs. movs; this is the correct 20-byte version:

            .cpu cortex-m4
            .thumb
            .syntax unified
            .thumb_func
            .globl fib
    fib:    movs r1, #1
            movs r2, #1
        1:  subs r0, #1
            itt lo
            movlo r0, r1
            bxlo lr
            mov r3, r2
            adds r2, r1
            mov r1, r3
            b 1b
I swear, I keep making the same n00b mistakes, year after year!

It might be possible to get it down to 18 bytes, but I don't see how.

4. brucehoult ◴[] No.45167492[source]
> Both the RISC-V and Thumb code here are no longer recursive fib

If the programmer wrote a doubly-recursive fib() and the compiler transformed one to tail-recursion, then that's a doubly-recursive fib() as far as I'm concerned. If you don't want that to happen then use a benchmark that isn't so easily optimised, such as maybe Ackermann.

Note that this optimisation doesn't change the calculation performed or the execution time from being exponential, it merely saves some stack manipulation.

Having access to good compilers is part of the attraction of using a real ISA instead of a made-up one.

Changing the algorithm to an iterative linear-time form one is not an optimisation I'd expect a compiler to do.