2009 January 23 17:34

Freescale forums save me again

When I started the long bootstrapping process (that I’m still deeply mired in) I initially had trouble getting my breadboarded 908QB8 to talk to the serial port. (These parts have a bulit-in ROM monitor that bitbangs a serial protocol on a port pin.) After a day or so of going crazy, I found this thread on the 8-bit forum about the 908 parts requiring a really clean power-on reset (POR). Vdd needs to go to within a few tens of millivolts of ground before it comes back up.

I measured Vdd with the power disconnected from my breadboard, and, sure enough, it was stuck at 2v or so. The (cheap) voltmeter I used to measure this quickly discharged the on-board capacitors, and I got a good POR that way, and was able to get into the ROM monitor and talk to the chip.

By adding, between Vdd and Vss, a reversed-biased diode in series with a small resistance, I was able to get reliable resets. I would never had figured this out without the forum.

I’ve run into trouble in the second stage of my bootstrap as well: getting the 908QB8 to act as a background debug mode (BDM) host, talking to an S08QG8. I can measure BDM SYNC pulses with some measure of reliability and repeatability, but I can’t seem to get anything out of the background debug controller (BDC) on the S08.

Last night, while perusing the 8-bit forum for more posts about “trouble” with the DEMO boards’ built-in BDM interfaces, I stumbled across a link to my answer. It turns out that blank S08QG chips have to be powered up into active background mode, by holding the BKGD/MS (MS means “mode select”) pin LOW during POR. I’m doing nothing of the kind. I plug a cable into a 4-pin header on the breadboard, and everything powers up at the same time. The 908 comes up fine every time, but apparently the S08QG (and other low-pin-count devices that lack a dedicated RESET pin) are a bit hard to get reliably into BDM when they are blank. They power up and start executing random code – unprogrammed Flash, peripheral registers – and continually either experience COP (computer operating properly – ie, watchdog) resets or illegal instruction resets. There isn’t a long enough window between resets to get the device into active BDM mode – the protocol takes long enough that it’s basically not possible.

So I need to fashion a way to power cycle the S08 from the 908. Initially I’m going to jumper it by hand, just to make sure this works.

Another side effect of powering up an S08QG in active BDM, is that its BDC operates from a 4Mhz clock (the bus clock) rather than the dedicated “alternate BDC clock”, which is 8MHz out of reset. My 908 is running at 3.2MHz, and I figured out that bitbanging BKGD would be too slow, if the target is running at 8Mhz. So I dug up a 25Mhz crystal and fashioned an oscillator, so I could run my 908 at 6.25Mhz (bus clock).

Now I can go back to a simple setup, since my 3.2Mhz 908 is going to be plenty fast to talk to a 4Mhz S08.

Sorry about all the acronyms!

2009 January 18 20:11

Is the Freescale ecosystem healthy?

I’m a bit worried about it.

It seems that Freescale or P&E Micro are intentionally hobbling the USB-BDM interfaces on Freescale’s DEMO boards, forcing people to buy development hardware that they don’t need – like an external USB-Multilink.

One of Freescale’s avowed core values is “to be the most ethical company in the business”. I think they have some self-reflection to do here, and in fact I think the user community needs to take them to task.

Folks in the ARM world crow about the “ARM ecosystem” and how wonderful it is to have multiple chip, development board, and software tool vendors. It was something that attracted me to the ARM architecture for a long time.

The 8-bit world is different. Except for the 8051, the popular 8-bit architectures – HC08, PIC, AVR – are single-sourced. Every architecture has a development tool story as well. Except for AVR, there are few to no free tools for the 8-bit chips.

In the Freescale world, people pretty much use CodeWarrior (CW) exclusively, and it’s easy to buy relatively inexpensive ($50 to 100) “demo” boards from Freescale to try out a chip family. There is a free version of CW, but it is limited in the code size that it will compile. That’s another story, and it deserves its own post.

I found out about this “hardware hobbling” by doing some research on Freescale’s 8-bit processor forum. I was considering buying a DEMOJM board. This is a “motherboard”, with built-in USB-to-BDM converter, some i/o, and a square of headers to plug a doughtercard into. It comes with two daughtercards: one for the S08JM60, and one for the MCF51JM128 (a ColdFire part). It looks like a good way to get started with the JM family – these are the nifty USB parts – and because it has a built-in USB-BDM, I figured I could use it to program parts from other families – at least other 5v parts.

But I found some alarming posts on the forum. People were asking “can I use a DEMOxx board to program other parts?” and others were responding “sure, the DEMOxx has a full USB-BDM interface built in.” Then the trouble started. Someone with a DEMOQE board – which, like the DEMOJM comes with two daughtercards – couldn’t program, using the built-in USB-BDM, anything other than the chips on the daughtercards. These were S08QE chips too – not chips from some other family.

It turns out that the latest version of CodeWarrior – version 6.2 – has code that intentionally breaks these boards! It’s possible, by replacing a few DLLs, to fix the problem.

I really think that the community needs to let Freescale know that this kind of thing is hardly ethical behaviour. I don’t want my black-box dev tools, when upgraded, to silently break perfectly good hardware that I’ve legitimately purchased! At the very least this creates confusion and leaves a bad taste. “This used to work; why doesn’t it work any more?” Or, “this board has a USB-BDM; I bet I can use that on these other chips!” There is no official explanation, when it doesn’t work, about why, and I think this needs to be remedied.

Having had the experience of trying something that should work, but doesn’t, people are going to respond in one of three ways – none of them good:

In chronological order, here are a few threads from the forum that document all this craziness.

2009 January 18 13:39

Is ColdFire next?

I spent several hours yesterday carefully reading ColdFire documentation, trying to understand what it was Freescale had accomplished, or at least what they left out of the 68000 to make ColdFire. You have to do more than glance thru the new instruction set. I did that a few weeks back and didn’t get the sense that much had changed.

It turns out they left out a lot. But not in a bad way. Now the architecture really is much more a RISC-like load-store machine.

Operand size changes:

Addressing mode changes:

Instructions left out entirely:

Instructions added:

Because the JM family (the USB parts) has ColdFire parts as well as S08, and they are not much more expensive – I may be experimenting with Forth on the ColdFire too. Especially if I get a JMBADGE.

2009 January 16 13:46

It’s all about the timing

I think my current interest in Freescale’s HC08 has everything to do with timing. I’ve looked at the architecture before, and haven’t been excited about it until now.

In fact, I’ve looked at it several times, initially in 2001 or 2002. I was overwhelmed then with how many families there were, and the lack of easily available development hardware. Also, the parts at that time were quite slow – I believe this was before the introduction of the S08. I ended up instead buying a Cygnal (now SiLabs board, which I played around with a tiny bit. I was impressed with the idea of an 8-bit CPU running at 25MHz, but unimpressed with the idea of writing code for the 8051. And their parts are $15 in quantity one. The S08JM32 – a pretty comparable part, running at 24MHz, with 12-bit ADC, and the added joy of a USB interface – is $3 in quantity one.

I just discovered Freescale’s agitprop zine Beyond Bits, which has been published for the past three summers. Looking at back issues I realised that only now is their lineup compelling to me. The JM USB parts are new, the Flexis (AC, JM, QE) are new. The S08 is fairly new. All of this is recent.

When I looked at the HC08 in 2006, and sampled the 908QB8 and S08QG8 parts, there were very few parts that seemed interesting. For my purposes (wanting DIP16, wanting small package etc) these were really the only matches. They seemed like nice parts, but I wasn’t impressed that this was an architecture I could really grow with. There was a way better selection then of PIC and AVR parts, esp in DIPs.

Now things are dramatically different. I’m looking forward to digging deeply into this architecture!

2009 January 14 21:54

Putting Forth on the Freescale S08

I’ve been looking, literally for years, for an 8-bit microcontroller to get excited about, dig into deeply, and turn into a convivial tool. I still haven’t found what I’m looking for (apologies to U2 and Negativland!) and went as far as writing an essay entitled "Eight bit microcontrollers are obsolete", based mostly on my belief that the ARM7 was going to be the next 8051.

I still really like the little 8-bit machines. They are cheap – still cheaper than their 32-bit “cousins” – and are much more approachable by novices to the art – something that is actually important to me, as I’m not thinking about using them as black boxes buried deep inside something else – the classic embedded control model – but instead as paint and canvas. I want them to be convivial tools – both for me and for others as well.

Recently disillusioned with the ARM world – both ARM7 and Cortex-M3 – and irritated again with the PIC and AVR options – the PICs are too expensive, and AVRs have crummy Flash, and are also expensive – I recently took another deep look at Freescale’s HCS08 families, and for the most part liked what I saw.

What can these materials do?

Before getting into the meat of things, I’d like to acknowledge something that Steve Davee said in his recent DorkbotPDX talk that continues to inspire me. He was talking about teaching art to kids. Instead of putting a bunch of art materials in front of them and saying “What can you do with these materials?” – which would cause me, even as an adult, to freeze up creatively – he says “What can these materials do? Let’s find out!”

That’s what I feel I’m doing right now. I’m exploring the HCS08 with the mantra in my head: What can these materials do?

I’m finding it very helpful, and also sustaining.

Shoehorning Forth

What are the issues we face when considering putting Forth on the S08?

When figuring out how to put Forth on any architecture, there are two main issues to consider, and they are intertwingled:

With the HC08, there are so few index (pointer) registers, that any kind of threading, while possible, seems inordinately expensive. The HX register will have to double duty (at least) as the IP and a stack pointer; all the saving & restoring will take time; and it will prove difficult – as in the case of NEST, which needs to push the IP onto the return stack – to have a “place to stand” as we’re copying values around.

Don’t get me wrong – it’s doable, and may even be worth doing. But the high overhead doesn’t appeal to me. I’d like Forth on this machine to rival C – and it might even beat it.

Native code compilation means that the body of a high-level word (a so-called “colon” word, because they are defined by the : operator) is going to be simply code – mostly a series of calls to other pieces of code.

It’s sometimes useful even when not doing any kind of threading to represent literals in the normal Forth way:

    jsr push_literal
    <literal value>

Again, on this architecture, the overhead of popping an address (to get the address of the literal), fetching both bytes, and pushing them onto the data stack, is quite unappealing. On a machine with more registers, we would simply do one or more load immediates, and then call or jump into the word that expects a literal. On the HC08 there is only one 8-bit register free for this purpose: the accumulator (A). This works fine for 8-bit literals. But what about 16-bit?

I found a nice solution, but before I explain it, let’s talk about the register allocation. Because we are compiling native-code, the hardware stack is the return stack, and we simply use jsr/rts to do our “threading”. There is only one option for the data stack pointer, and that’s HX, which will occasionally have to do double duty as a generic pointer register, and in those cases we’ll have to save and restore its value around these other uses so we don’t clobber the data stack pointer. That’s overhead we’ll have to live with.

We can reduce it somewhat by adopting the convention that the data stack lives in the zero page. In this case, H is always 00, and X is really the stack pointer. (Just like the HC05 from which the HC08 is descended!) When we have to use HX to point to arbitrary locations, we only have to save X; when we’re done with HX, we restore X, and clear H (which only takes one cycle, instead of the three or more needed to restore its value from the R stack or memory). I’ll try it both ways and maybe try to measure the performance difference.

So, given that HX points to the data stack, we could adopt the (obvious) convention that 0,x and 1,x are the top of stack; 2,x and 3,x are the second; etc. But it turns out that wonderful things happen if we instead pre-allocate a 16-bit cell as scratch space at the top of the stack. The real top moves to 2,x and 3,x; and second moves to 4,x and 5,x.

Why is this so useful?

Because we are so register starved, it’s often nice to have an “extra hand” – a place to put a byte of literal data, a count, the saved value of A, etc. Having this scratch space available makes programming in assembler much easier, and it allows for the arithmetic and logical operators to share code between their literal and non-literal versions. I’ll explain.

Let’s first look at the code for + (add), which consumes the top two values on the stack, adds them, and pushes the result. Another way to think about it is that it pops the top value and adds it to the second, making the sum the new top. Here is add:

    ; stack (,x) offsets - this is a big-endian machine!
    ; 0  scratch_hi
    ; 1  scratch_lo
    ; 2  top_hi
    ; 3  top_lo
    ; 4  second_hi
    ; 5  second_lo
    add:      lda 3,x    ; top_lo into A
              aix #2     ; add 2 to HX; top becomes scratch; second becomes top
    add_imm:  add 3,x    ; A = top_lo + prev_top_lo
              sta 3,x
              lda 0,x    ; scratch_hi = old top_hi
              adc 2,x    ; add top_hi
              sta 2,x    ; save sum_hi

I can write subtract, and, or, xor the same way. What’s neat about this is that instead of calling add to add two values on the stack, I can add a literal to the top value by doing this:

    lda #lit_hi
    sta 0,x     ; save in scratch_hi
    lda #lit_lo
    jsr add_imm

There is a subtle correspondence here. The literal version loads the high half of the literal into scratch_hi (0,x), and the low half into A, then calls add_imm. The code at add does a very similar thing: it loads top_lo into A, and leaves top_hi where it is; but by popping the stack, top_hi “moves” into scratch_hi, where add_imm expects it – but it hasn’t actually moved! It just “happens” to end up in the right place.

I think this is a very neat solution to the problem of having too few registers on the HC08. I wonder if any other compilers leave this kind of “scratch space” on the stack by convention? I’m stunned at how useful it is.

However, it doesn’t end there! We can do more. By recognising that each byte of a bitwise logical operation is independent, if we want to compile a logical op with a literal value that only affects one byte, we can inline:

    lda #$FE
    and 2,x     ; and with top_hi
    sta 2,x

On a threaded Forth with two-byte (16 bit) literals, this would take six bytes of code – which is exactly what our inline version takes! But it runs much faster.

We can do this optimisation if we are adding or subtracting a literal with zeroes in the low byte. Carry propagates from less significant bits upward, not the other way! So there is no way that an add or subtract in the high byte can affect the low byte.

We can make another small optimisation as well. Loading a full 16-bit literal (into A and scratch_hi) takes 5 bytes. What about shorter literals? Are there other (shorter) code sequences we can use? It turns out that there are. For literal values between $FF00 (-256) and $01FF (511) we can load the value in 3 or 4 bytes. Here is the code:

    clr 0,x     ; clear scratch_hi
    lda #lo     ; low half of literal
                ; for values between 0 and $FF; 3 bytes
    clr 0,x     ; clear scratch_hi
    com 0,x     ; scratch_hi = FF; we could use dec 0,x as well
    lda #lo     ; low half of literal
                ; for values between $FF00 and $FFFF; 4 bytes
    clr 0,x     ; clear scratch_hi
    inc 0,x     ; scratch_hi = 01
    lda #lo     ; low half of literal
                ; for values between $0100 and $01FF; 4 bytes

These code sequences are short because the “indexed with zero offset” instructions are only a byte long!

One last comment about literals. Mostly we want to load a literal value and then call some code that consumes it, and we avoid pushing it onto the stack first. On the rare occasions that we want to leave a literal value on the stack – as a return value or flag, eg – we simply compile the load literal, as above – either 16-bit or 8-bit versions – followed by code to “push” the loaded literal, which is these two instructions:

    sta 1,x     ; save literal_lo in scratch_lo
    aix #-2     ; "promote" scratch to top by allocating a stack cell

I think this is pretty compelling demonstration that with a bit of thought and cleverness we can fit Forth pretty nicely on this machine. I’m hoping it won’t take up too much space, and that it’ll run like a cat with its tail on fire! (apologies to PETA and the SPCA).

2009 January 11 17:37

Deep in a multi-stage bootstrap

I’m trying something slightly crazy, and progress is going slowly.

I want to get Forth and a simple USB bootloader running on Freescale’s S08JM parts. But instead of doing that the easy way, and buying P+E Micro’s USB Mulitlink programmer, I decided it would be interesting to see if I can bootstrap from nothing.

Well, not quite nothing. But to explain that I have to explain some history. The HC08 is really two sub-families: the original HC08 and the newer HCS08. I refer to these here as “908” and “S08”. (The “9” means Flash memory, rather than ROM or OTP EPROM. The 908 and S08 parts both use the same Flash.)

The S08s are nicer in almost every respect than the 908s, but they are harder to bootstrap. In fact, without the use of another microcontroller they are basically impossible to bootstrap. Rather than having a bootloader in ROM, like the 908s, they have a piece of hardware on-chip – the background debug controller (BDC). This bit of logic is really a poor man’s JTAG. It allows access to the CPU registers, can start and stop the core, read and write memory, and because it can do all these things, it can run code that writes to Flash.

These are really nice features, but the downside is that the custom protocol required to talk to the BDC needs to be driven, with fairly tight timing, by another microcontroller.

Could I use a 908 to bootstrap an S08?

In 2006 I sampled two flavors of 8k Flash, DIP16 parts: the 908QB8, and the S08QG8. Since the 908 has a built-in bootloader, I thought I would start by talking to it, then put just enough code on the 908 to talk the BDC protocol to program the S08... Then put a nice, simple, UART-based bootloader on that chip that I could use to program the S08JM parts – my eventual goal. A long road, but mostly an interesting one.

I met with some initial success. Using two resistors and an HCT244 I was able to connect an RS232 level converter to the 908’s PTA0, which the bootloader bitbangs at 9600 bps.

I wrote a bit of Forth code that allowed me to read and write memory, the registers on the stack, and was able within a couple of days of fiddling around to download code into RAM and run it successfully.

Flushed with success, I breadboarded an S08QG8 next to the 908QB8. Since most of the S08 families run at 3v (1.8 to 3.6), I had a voltage-conversion problem. Since I already had an HCT244 on the breadboard, I thought I would use that in the S08 to 908 direction, since the HCT input levels will be just about right for 3v logic. Driving the S08’s BKGD pin to 3v is a bit trickier, and I’m not sure I’ve come up with a good way to do it.

Freescale describe the BDC protocol as “quasi-open-drain”. Since it’s a one-wire protocol with multiple senders, it needs to have a pullup to Vdd (which is internal to the S08), and the senders need to be able to drive it high, drive it low, and 3-state it.

My approach involved using three port pins: one directly connected to BKGD to pull it low; and a voltage divider between two others – one driven high and one low – that would put approx 3v on BKGD. This doesn’t seem to work, and I’m not sure why.

Of course, there is another issue that I haven’t mentioned that consumed quite a bit of time to debug. Since the BDC protocol is based on the target’s clock speed – each bit transmitted taking 16 BDC cycles – I need to drive BKGD fast enough to keep up with the target. The “host” – in this case the 908QB8 – has an internal oscillator that at the fastest setting yields a bus clock (instruction clock) of 3.2MHz. The 9S08QG8 runs its BDC clock at 8MHz. The tightest code I could write for sending and receiving bits was too slow, so I spent a bunch of time debugging oscillator settings.

After finding a 25Mhz crystal, I’m now running the 908 with a 6.25MHz bus clock, which is fast enough to keep up – but it still doesn’t work.

Which is why I thought I’d take a break and document my progress, instead of driving myself crazy.

Read the 2008 journal.