My aim here is to tackle threaded code. What it is, why it is, how it works, and why it’s great (and also not so great).

Rather than starting with a definition and then attempting an explanation of the definition, I want to start with a thought experiment.

Take a program. Then refactor it in such a way that it consists entirely of only two kinds of routines:

(I was tempted to use the usual tree-structure names of “leaf” and “branch”, but branch is overloaded with the idea of a jump, which might be confusing in this context. So: twig.)

The performance-minded among you are now aghast. “What? No open-coded arithmetic? No inlining of functions?”

Yes, it’s true. Our restructuring has a performance cost, but is has advantages too. We’ll get to both.

Since our twig routines now consist entirely of calls (except for the last instruction, a “return”) we can remove the call opcodes. We end up with a list of addresses. The return instruction that ended our original routine is replaced by the address of a special “exit” routine.

The tail-recursively-minded among you will have already objected “But that last call and return can be replaced by a jump!” Not so with threaded code. There is no way to say we want to “jump” to an address rather than “call” it, so we lose the ability to convert tail calls to jumps. Twig routines always end with the address of the “exit” routine. Another disadvantage to threading. However, we soldier on.

With the opcodes removed, our twig routine is no longer code. We can’t “call” it and expect anything good to happen. The CPU will happily start interpreting the list of addresses as instructions and head off into the weeds.

We need a header of some kind that tells the CPU that this is not a leaf routine (which we haven’t changed – it is still machine code) but a twig routine. The header is generally a call to another special “runtime” routine. What should this routine do? Because we have been called, we must have been called by another twig routine. Just as with machine code that calls other machine code, we need to push a call stack of some kind. But what do we push onto the stack?

Let’s back up. We haven’t talked about how we are interpreting the list of addresses that make up the body of a twig routine.

We need a tiny piece of code somewhere that fetches the next address and “executes” the routine that it points to. (I am being purposely vague here.) We are fetching and executing “instructions”, which means we need our own “program counter” or “address pointer” or whatever you want to call it, which will always point to the next “instruction” (really an address) to execute. In the Forth world this pointer is usually called IP – the “instruction” (or perhaps “interpretation”) pointer. Our tiny piece of code will fetch the “instruction” (address) that IP points to, increment IP by the size (in bytes) of an address, and then somehow “execute” the code at that address.

I’m not sure about other language communities, but in the Forth community this tiny piece of code is called NEXT. Its only job: to execute the next address in the list.

Now that we have an idea of what it means to execute a list of addresses, what happens when a twig routine calls another twig routine? In order to interpret the list of addresses in the called twig routine we have to set IP to point to the first address and then run NEXT. But at that moment IP points to the place in the caller’s list of addresses that we should return to, and we don’t want to clobber it, otherwise we would “lose our place” in the execution of the calling twig routine.

Thus, before setting IP to point to the first address in the called word, we have to push the caller’s IP onto our call stack.

And this is precisely the code that the header of our twig routine calls. I call it NEST. It is often called DOCOLON (for reasons that should become clear later).

The dual of this – the special “exit” routine alluded to above, the address of which ends every list of addresses in a twig routine – I call UNNEST. It pops the previous IP off the call stack, and then runs NEXT, thus resuming interpretation in the calling twig routine. (Many Forths call this EXIT, for obvious reasons.)

We have all the pieces now, but a few questions remain. Where is NEXT? And what is NEXT?

We can do another thought experiment. Let’s say we are executing NEST. We push the caller’s IP, set IP to point to the first address in our called twig routine... And then what? Do we return? If so, to what? There isn’t a “main” machine code routine to return to. Weirdly, there isn’t a loop somewhere that repeatedly executes NEXT. Instead, NEXT is everywhere.

Every leaf routine ends, not with a return instruction, but with NEXT. NEST ends with NEXT. UNNEST ends with NEXT.

To “fire up” this interpreter, all we have to do is set IP to point somewhere, and then execute NEXT. Any NEXT. Anywhere. At this point, our threaded code interpreter is off and running.

To recap: We need three small routines to make all this work: NEXT, NEST, and UNNEST.

  NEXT:    Fetch the address pointed to by IP
           Increment IP by address size
           Jump to fetched address
  NEST:    Push IP onto call stack
           Set IP to point to the first address of the called twig routine
           Execute NEXT
  UNNEST:  Pop IP from call stack
           Execute NEXT

Because NEXT is short – often one to three instructions – it is usually copied inline into the body of NEST, UNNEST, and any other leaf routines.

Direct-threaded code (DTC)

What I have just described is usually referred to as direct-threaded code (DTC). It is direct because NEXT jumps directly to the address fetched from the list of addresses that make up the body of a twig routine.

Because Forth has two stacks – a data (or parameter) stack, and a return (or call) stack, we should be clear that NEST and UNNEST push and pop the return stack (often called the R stack). In fact, I don’t see how this threading model can work with a single stack.

(Side note: Forth’s call stack and the machine’s call stack do not have to be the same! And often they are not. If the machine has push and pop operations it is usually more efficient to use the machine’s call stack as the Forth data stack, and to put the Forth call stack (the return, or “R” stack) somewhere else.)

It’s not clear from my description above how NEST computes the new IP (the address of the first address in the list). There are two ways to do it.

If the “header” of the twig routine is a call to NEST, then NEST can pop the (machine’s) call stack. The popped address will be the address of the first address in the list. (This is the semantics of any call instruction: push the following address onto the machine’s call stack, then jump to the destination. We are “misusing” call to push the address of the list of addresses, which immediately follow the call in memory.)

But there is another way, and it’s often more efficient. We designate another machine register – often called W (for “word”, as pieces of Forth code are usually referred to as “words”) – for use by NEXT and NEST, and modify them thus:

  NEXT:    Fetch into W the address pointed to by IP
           Increment IP by address size
           Jump to W
  NEST:    Push IP onto call stack
           Set IP to W + sizeof(Jump)
           Execute NEXT

Now the “call NEST” is replaced by “jump NEST” in the twig routine’s header. Since W points to the header, by offseting it (by the size of the jump instruction), we can calculate the address of the first address in the list, which directly follows the jump to NEST. We set IP to this address.

Our two kinds of words – leaf and twig – now look like this:

  leaf word             twig word
  ~~~~~~~~~             ~~~~~~~~~
  machine code          jump NEST
  machine code          address 1
     ...                address 2
  machine code             ...
  NEXT                  address n
                        address of UNNEST

When NEXT executes a leaf word, that “Jump to W” jumps directly to machine code. But when it executes a twig word, it jumps to a jump. We all know how terrible this is! Pipeline refill time!

What if we replaced this “code indirection” – a jump to a jump – with a data indirection?


Indirect-threaded code (ITC)

By replacing the jump NEST (or call NEST) instruction at the beginning of every twig word with a pointer to NEST, and by adding an indirection to NEXT:

  NEXT:    Fetch into W the address pointed to by IP
           Increment IP by address size
           Fetch into X the address pointed to by W
           Jump to X

and by making a small change to NEST:

  NEST:    Push IP onto call stack
           Set IP to W + address size
           Execute NEXT

we arrive at indirect-threaded code, also known as ITC.

Remember: DTC is direct because NEXT jumps directly to the address fetched from the list of addresses that make up the body of a twig routine.

Thus: ITC is indirect because NEXT jumps indirectly through the address fetched from the list of addresses that make up the body of a twig routine.

I’ve written the above version of ITC NEXT the way you would write it for a RISC machine. A CISC machine with auto-increment addressing can collapse the first two instructions into one; one with indirect addressing can collapse the last two into one.

ITC NEXT is two instructions on the MSP430, and four – basically the four steps outlined above – on RISC-V.

There is another change we need to make for indirect threading to work. With DTC the bodies of both leaf and twig words started with machine code; with ITC, the bodies of both have to start with a pointer to machine code.

Which means we have to change our leaf words for ITC. We add a machine code pointer that simply points to the code that follows it.

  leaf word             twig word
  +----------+          +----------+
  |   addr   |---+      |   addr   |-------> NEST
  +----------+   |      +----------+
  machine code <-+      address 1
  machine code          address 2
     ...                   ...
  machine code          address n
  NEXT                  address of UNNEST

Useful nomenclature

Since it’s tricky to explain all this, it’s useful to have a common vocabulary when talking about threading and the structure of words in a threaded system. Again, since it’s what I’m familiar with, I’m going to use the Forth terminology.

The machine code pointer at the start of every word in an ITC system is called a code pointer.

That first slot of a word – in both ITC and DTC systems – is called the code field, and the body that follows – machine code in the case of a leaf word, and addresses in a twig word – is called the parameter field.

Leaf words are called code words, and twig words are called colon words. These names come from the Forth words that construct them: code and :.

Since we would like to be able to think and talk clearly about indirection, we give names to two important addresses:

The address of a code field is called, unsurprisingly, a code field address – usually abbreviated cfa.

The address of a parameter field is called, also unsurprisingly, a parameter field address – usually abbreviated pfa.

In a DTC system, a cfa points to a code field, which contains machine code.

In an ITC system, a cfa points to a code field, which points to machine code.

In both cases, the addresses in the body (parameter field) of a colon word (twig word) are all cfa’s. (Note: It’s possible to use pfa’s instead, and there are good reasons to do this.

Everything is executable!

We are not yet at the end of our story!

It turns out that code and colon (leaf and twig) words are not the only kinds of words in a threaded code system. What about data?

I’ll use the examples of two common Forth data types: variable and constants

While there are probably other approaches, on threaded Forth systems everything in the system – including data types! – has executable behaviour. You can – and indeed must – run everything.

What might it mean to execute a variable or a constant? Let’s start with constants, since they are simpler (though both are straightforward).

A constant pushes its value onto the data stack. That’s it!

Can we turn this idea of a constant into a structure in memory that will work with our threaded interpreter? Here is how:

     DTC                  ITC
  hex-fives            hex-fives
  ~~~~~~~~~            ~~~~~~~~~
  jump to CONSTANT     address of CONSTANT    (code field)
  5555_5555            5555_5555              (parameter field)

In both cases – DTC and ITC – the representation consists of a code field and a parameter field – just like code and colon words. Whereas colon words have a code field that refers to NEST, constants have a code field that refers to CONSTANT, which looks like this:

  CONSTANT: Add to W the size of the code field (calculate pfa)
            Fetch value pointed to by W into W (fetch contents)
            Push W onto the data stack
            Execute NEXT

Remember that after executing NEXT, W contains a cfa: it points to the code field, not the parameter field. (Actually, this isn’t written in stone.)

If we use hex-fives in a colon word, it gets compiled like any other word. To add hex-fives to the value on the top of the data stack, we would compile the following address list:

  address of hex-fives
  address of +

Variables are only slightly more complicated. Variables in Forth are treated like ML references: you have to explicitly dereference them. There are two words to do this: @ (“fetch”) and ! “store”. @ consumes an address from the data stack, fetches its value from memory, and pushes that value onto to the data stack; ! consumes a value and an address from the data stack, and stores the value at that address in memory.

What does a variable do? It pushes the address of its value onto the data stack.

Variables look like this:

     DTC                  ITC
    radix                radix
  ~~~~~~~~~            ~~~~~~~~~
  jump to VARIABLE     address of VARIABLE    (code field)
  2                    2                      (parameter field)

Here we have a variable called “radix” with the current value 2 (we are in binary).

The code that all variables execute is remarkably similar to CONSTANT. Where CONSTANT pushed value stored in the parameter field, VARIABLE pushes the __address_ of the parameter field – the pfa.

  VARIABLE: Add to W the size of the code field (calculate pfa)
            Push W onto the data stack (push pfa)
            Execute NEXT

Simple examples:

  radix @

fetches the current value;

  10 radix !

changes the radix to 10 (now we are in decimal).

It is out of the scope of this discussion, but every Forth system has mechanisms for creating new kinds of data with their own, specific execution behaviours.

Threaded code advantages

Elegance, simplicity, uniformity, code density.

It’s quite easy to bring up a threaded-code implementation on a new architecture, and, once running, high-level tools written for a completely different architecture are likely to work unmodified, or with very small modifications.

The structural uniformity makes it very easy to extend and modify without changing the underlying implementation. I can add new data structures and new control structures at will without modifying the core compiler.

Because we cannot inline code, everything becomes a separate word. This has an interesting side effect: it aggressively removes redundancy from programs, which tend to become much smaller.

On occasion, threading is actually faster than native code. In native code a subroutine call requires the equivalent of two jumps: the call, and the return. On pipelined machines these can take several cycles.

When executing code words in a threaded system, only one jump is required! Only one execution of NEXT is required to execute a word. As a result, on some systems, DTC can be faster than call/return, and ITC is often no worse that call/return (but much more powerful).

Threaded code disadvantages

Speed, and, sometimes, size.

Since we lose the ablitily to do any kind of code inlining, even simple things like “add these two numbers” have to be written as separate routines and “called” via NEXT. This is obviously slower than inlining.

In terms of size, on processors where an instruction is smaller than an address – eg, on 32-bit machines with 16-bit instructions – native code can sometimes be denser, but generally the “deduplication” that threading requires will offset this.


Threaded code is an extraordinarily simple and elegant way to implement language interpreters. Compared to bytecoded interpreters it is quite fast, and compilers for it are essentially trivial.

Because it creates a system entirely composed of uniform pieces, extensibility of the system by users is almost unlimited, and building generic tools is dramatically simplified!

All this for a modest execution time penalty... ;-)

To explore further

Read about threaded code variations. I talk about why sometimes it makes sense to put pfa’s rather than cfa’s into address lists (colon word bodies).

Read about literals, ifs, and loops in threaded code systems. I’ve suggested here that all colon words consist of a list of addresses that are executed in sequence. Real code branches and loops. And uses literal numbers.

Read about the difference between a “call” instruction and a “branch-and-link” instruction, and how both relate to threaded code.