What’s the big difference?
Years ago, while pondering the “nature” of threaded code something struck me: In a threaded code system the called routine decides whether or not to push a return address onto the call stack.
This is in contrast to native code running on “traditional” architectures with “call” instructions. Call instructions always push the return address, even when it’s unnecessary. (It’s unnecessary when the code being is called is leaf routine – one that makes no further calls. But the caller doesn’t – and can’t! – know this.)
RISC architects realized that always pushing the return address was inefficient – doing so creates unnecessary memory traffic – and instead of call instructions specified “branch and link” instructions.
What’s the difference?
Operationally, a call instruction does two things:
- pushes onto the stack the address of the following instruction
- jumps to the destination
A branch-and-link instruction – maybe it should be called link-and-branch? – also does two things:
- captures in a register (the “link” register) the address of the following instruction
- jumps to the destination
If the called code is a leaf routine, it doesn’t need to push the link register onto the stack. It is careful, however, to preserve its value; then, at exit, it returns to its caller by jumping to the address in the link register.
If the called code will be making calls to other code, on entry it pushes the link register (to preserve it), and on exit it first pops and restores it, and then retuns to its caller as above, by jumping to the address in the register.
Not created equal
There are two variations of branch-and-link:
- one that always uses the same link register (eg, ARM)
- one that allows specification of the link register (RISC-V, and probably others – I’m not a RISC expert. ;-)
The latter is much more powerful. If all your code does is “normal” calls and returns, the difference is unimportant. But there is a curious class of uses for which the difference is important: when using a branch-and-link to capture the following address for a purpose other than to return to it later.
Say whaaaaat? Why would you ever do this?
I mentioned in my discussion of threaded code that one can “misuse” a call instruction to capture the following address. This is sometimes useful when writing the “runtime” behaviors of Forth words. But using call in this way is inefficient: you capture the following address by pushing it, but then immediately pop it and do something with it.
Misusing branch-and-link instructions to do the same thing is much more efficient. The address is captured in a register, and the code moves on. No push and pop. (There is a jump involved, so perhaps a pipeline refill occurs.)
And if, for some reason, we want to do this twice, in immediate succession, we simply specify a different link register in each of the branch-and-links. (Hmm, and now we are perhaps doing two pipeline refills in quick succession...)
But again, why?
In a non-threaded implementation of Forth’s create/does words, this is precisely what happens. I’ll explain how this works by first showing how it works in a ITC (indirect-threaded code) system, and then replace some pointers with jump-and-link instructions.
We’ll look how three kinds of Forth words are represented: colon words, variables, and “incrementers” (which are going to be defined via create/does).
Our example colon word was defined in Forth like this:
: 4* 2* 2* ;
and its compiled form looks like this:
4* ~~~~~~~~~~~ addr of NEST ( code field) addr of 2* ( parameter field) addr of 2* addr of UNNEST
Our variable looks like this:
radix ~~~~~~~~~~~ addr of VARIABLE (code field) 2 (parameter field)
And our create’d word – the most complicated of our examples – was created by the following Forth code:
: incr create , does> @ + ; 4 incr cell+
To further complicate this example – which is really the key to my argument – we will assume a tethered cross-compiled Forth, which means that the words
create , does> execute on the host machine, and the target contains only the runtime pieces.
incr look like this:
cell+ +----------+ | addr |---+ ( code field) +----------+ | 4 | ( parameter field) | incr | ~~~~~~~~~~~ | jal DODOES <-+ addr of @ addr of + addr of UNNEST
DODOES: Push IP onto R stack Push pfa onto D stack Execute NEXT
Two notes about
incr. It is not a normal Forth word with a code field and a parameter field. It is like the built-in “runtimes” for colon words or variables. In this case – unlike with VARIABLE or NEST – we want to express the runtime in Forth code rather than machine code. But runtimes have to start with machine code. And a
does> runtime has to do two things: nest the execution of Forth (just as we would do if we were calling a colon word), and push the pfa of the create’d word (
cell+ in our example.)
Since this behavior is common to all
does> runtimes, we compile it once and
jal to it.
incr begins with
jal DODOES. This is the one place in an ITC system where code specific to an architecture has to be compiled into a word’s body. The Forth code in a parent (defining) word that specifies the execution behavior of child words – ie, the code following
does> – has to be prefixed by machine code of some kind so that the code field in child words can point to it. Code fields always point to machine code.
Now for the point of this exercise. Let’s convert these three examples to “native” code, using jump-and-link instructions in the bodies of colon words instead of lists of pointers. Note that we use two different link registers:
w in code fields, and
ra in colon word bodies.
4* ~~~~~~~~~~~ jal w NEST (code field) jal ra 2* (parameter field) jal ra 2* jal ra UNNEST
NEST in this world changes slightly from the ITC version. Notice that the
jal w NEST captures the pfa – which is the address of the first call to
2* – in
w. NEST now looks like this:
NEST: Push RA onto R stack jr 0(w)
jr 0(w) means “return to the body of the colon word and start executing the code there”.
Our variable looks like this:
radix ~~~~~~~~~~~ jal w VARIABLE (code field) 2 (parameter field)
jal w VARIABLE captures the pfa in
Since there is one more level of “nesting” in
does> words, we need a third link register. Let’s use x.
cell+ ~~~~~~~~~~~ jal w INCR (code field) 4 (parameter field)
INCR: jal x DODOES jal ra @ jal ra + jal ra UNNEST
DODOES: Push RA onto R stack Push W (pfa) onto D stack jr 0(x)
jal w INCR in
cell+ captures the pfa in
jal x DODOES in INCR captures the address of the Forth code that will be executed after W is pushed onto D stack. DODOES pushes both stacks, and then “returns” to the body of INCR.
It’s convoluted, and it’s possibly inefficient – because we are executing three
jal instructions in quick succession, and probably causing a series of pipeline refills – but it’s very little code, and it’s elegant in a twisted way. ;-)