I've been looking, literally for years, for an 8-bit microcontroller to get excited about, dig into deeply, and turn into a
convivial tool. I still haven't found what I'm looking for (apologies to U2 and Negativland!) and went as far as writing an essay entitled "Eight bit microcontrollers are obsolete!", based mostly on my belief that the ARM7 was going to be the next 8051.
I still
really like the little 8-bit machines. They are cheap - still cheaper than their 32-bit "cousins" - and are much more approachable by novices to the art - something that is actually important to me, as I'm not thinking about using them as black boxes buried deep inside something else - the classic embedded control model - but instead as paint and canvas. I want them to be convivial tools - both for me and for others as well.
Recently disillusioned with the ARM world - both ARM7 and Cortex-M3 - and irritated again with the PIC and AVR options - the PICs are too expensive, and AVRs have crummy Flash, and are also expensive - I recently took another deep look at Freescale's HCS08 families, and for the most part liked what I saw.
What can these materials do?
Before getting into the meat of things, I'd like to acknowledge something that Steve Davee said in his recent
DorkbotPDX talk that continues to inspire me. He was talking about teaching art to kids. Instead of putting a bunch of art materials in front of them and saying "What can
you do with these materials?" - which would cause me, even as an adult, to freeze up creatively - he says "What can these materials do? Let's find out!"
That's what I feel I'm doing right now. I'm exploring the HCS08 with the mantra in my head:
What can these materials do?
I'm finding it very helpful, and also sustaining.
Shoehorning Forth
What are the issues we face when considering putting Forth on the S08?
When figuring out how to put Forth on
any architecture, there are two main issues to consider, and they are intertwingled:
- native or threaded code?
- how are we using the registers?
With the HC08, there are so few index (pointer) registers, that any kind of threading, while possible, seems inordinately expensive. The HX register will have to double duty (at least) as the IP and a stack pointer; all the saving & restoring will take time; and it will prove difficult - as in the case of NEST, which needs to push the IP onto the return stack - to have a "place to stand" as we're copying values around.
Don't get me wrong - it's doable, and may even be worth doing. But the high overhead doesn't appeal to me. I'd like Forth on this machine to rival C - and it might even beat it.
Native code compilation means that the body of a high-level word (a so-called "colon" word, because they are defined by the : operator) is going to be simply code - mostly a series of calls to other pieces of code.
It's sometimes useful even when not doing any kind of threading to represent literals in the normal Forth way:
jsr push_literal
<literal value>
Again, on this architecture, the overhead of popping an address (to get the address of the literal), fetching both bytes, and pushing them onto the data stack, is quite unappealing. On a machine with more registers, we would simply do one or more load immediates, and then call or jump into the word that expects a literal. On the HC08 there is only one 8-bit register free for this purpose: the accumulator (A). This works fine for 8-bit literals. But what about 16-bit?
I found a nice solution, but before I explain it, let's talk about the register allocation. Because we are compiling native-code, the hardware stack is the return stack, and we simply use jsr/rts to do our "threading". There is only one option for the data stack pointer, and that's HX, which will occasionally have to do double duty as a generic pointer register, and in those cases we'll have to save and restore its value around these other uses so we don't clobber the data stack pointer. That's overhead we'll have to live with.
We can reduce it somewhat by adopting the convention that the data stack lives in the zero page. In this case, H is always 00, and X is really the stack pointer. (Just like the HC05 from which the HC08 is descended!) When we have to use HX to point to arbitrary locations, we only have to save X; when we're done with HX, we restore X, and clear H (which only takes one cycle, instead of the three or more needed to restore its value from the R stack or memory). I'll try it both ways and maybe try to measure the performance difference.
So, given that HX points to the data stack, we could adopt the (obvious) convention that 0,x and 1,x are the top of stack; 2,x and 3,x are the second; etc. But it turns out that wonderful things happen if we instead pre-allocate a 16-bit cell as
scratch space at the top of the stack. The real top moves to 2,x and 3,x; and second moves to 4,x and 5,x.
Why is this so useful?
Because we are so register starved, it's often nice to have an "extra hand" - a place to put a byte of literal data, a count, the saved value of A, etc. Having this scratch space available makes programming in assembler much easier, and it allows for the arithmetic and logical operators to share code between their literal and non-literal versions. I'll explain.
Let's first look at the code for + (add), which consumes the top two values on the stack, adds them, and pushes the result. Another way to think about it is that it pops the top value and adds it to the second, making the sum the new top. Here is add:
; stack (,x) offsets - this is a big-endian machine!
; 0 scratch_hi
; 1 scratch_lo
; 2 top_hi
; 3 top_lo
; 4 second_hi
; 5 second_lo
add: lda 3,x ; top_lo into A
aix #2 ; add 2 to HX; top becomes scratch; second becomes top
add_imm: add 3,x ; A = top_lo + prev_top_lo
sta 3,x
lda 0,x ; scratch_hi = old top_hi
adc 2,x ; add top_hi
sta 2,x ; save sum_hi
rts
I can write subtract, and, or, xor the same way. What's neat about this is that instead of calling add to add two values on the stack, I can add a literal to the top value by doing this:
lda #lit_hi
sta 0,x ; save in scratch_hi
lda #lit_lo
jsr add_imm
There is a subtle correspondence here. The literal version loads the high half of the literal into scratch_hi (0,x), and the low half into A, then calls add_imm. The code at
add does a very similar thing: it loads top_lo into A, and leaves top_hi where it is; but by popping the stack, top_hi "moves" into scratch_hi, where add_imm expects it - but it hasn't actually moved! It just "happens" to end up in the right place.
I think this is a
very neat solution to the problem of having too few registers on the HC08. I wonder if any other compilers leave this kind of "scratch space" on the stack by convention? I'm stunned at how useful it is.
However, it doesn't end there! We can do more. By recognising that each byte of a bitwise logical operation is independent, if we want to compile a logical op with a literal value that only affects one byte, we can inline:
lda #$FE
and 2,x ; and with top_hi
sta 2,x
On a threaded Forth with two-byte (16 bit) literals, this would take six bytes of code - which is exactly what our inline version takes! But it runs
much faster.
We can do this optimisation if we are adding or subtracting a literal with zeroes in the low byte. Carry propagates from less significant bits upward, not the other way! So there is no way that an add or subtract in the high byte can affect the low byte.
We can make another small optimisation as well. Loading a full 16-bit literal (into A and scratch_hi) takes 5 bytes. What about shorter literals? Are there other (shorter) code sequences we can use? It turns out that there are. For literal values between $FF00 (-256) and $01FF (511) we can load the value in 3 or 4 bytes. Here is the code:
clr 0,x ; clear scratch_hi
lda #lo ; low half of literal
; for values between 0 and $FF; 3 bytes
clr 0,x ; clear scratch_hi
com 0,x ; scratch_hi = FF; we could use dec 0,x as well
lda #lo ; low half of literal
; for values between $FF00 and $FFFF; 4 bytes
clr 0,x ; clear scratch_hi
inc 0,x ; scratch_hi = 01
lda #lo ; low half of literal
; for values between $0100 and $01FF; 4 bytes
These code sequences are short because the "indexed with zero offset" instructions are only a byte long!
One last comment about literals. Mostly we want to load a literal value and then call some code that consumes it, and we avoid pushing it onto the stack first. On the rare occasions that we want to
leave a literal value on the stack - as a return value or flag, eg - we simply compile the load literal, as above - either 16-bit or 8-bit versions - followed by code to "push" the loaded literal, which is these two instructions:
sta 1,x ; save literal_lo in scratch_lo
aix #-2 ; "promote" scratch to top by allocating a stack cell
I think this is pretty compelling demonstration that with a bit of thought and cleverness we can fit Forth pretty nicely on this machine. I'm hoping it won't take up too much space, and that it'll run like a cat with its tail on fire! (apologies to PETA and the SPCA).