The actual input tokenizer for Larry on C64

Thu, Nov 2, 2023

Following up on yesterday’s conceptual musings, here’s some more meat to the actual implementation. In this blog I’m exploring the actual Assembly code that converts individual words to tokens for use by the input parser later on in the process.

First, a short recap of the previous post. We’re going to walk through the following steps to get from any line of text input to a sequence of tokens that the game can interpret and act upon:

Clear the token buffer: allocate 16 bytes.
Start looping over the text input, being the actual line of text in screen RAM.
Figure out the first letter of the word we’re dealing with.
Traverse a quick decision tree to tokenize only words with this letter.
Figure out which word we actually got (if any) and put a token for it in the buffer.
Move the current_token by one position and continue parsing the line.

In the end, we’ll wind up with a 16-byte token buffer that is either all-zeroes because nothing sensible was entered or a sequence of 1 to 16 non-zero tokens.

Deciding which routine to run

I’m going to split the tokenizer routines based on the first character of input on the screen editor. The size of this decision tree depends on which words are relevant in the current scene. Worst case of all we’ll get 26 letters + 10 numbers to traverse, which fortunately does not happen in reality. Conceptually this is what the code looks like:

scene_tokenize:
    ldx #$00                        	// Init X
    ldy #$00                        	// Init Y
    lda #$00                        	// Init A
    sta current_token              	// Set the current token
st_begin:
	lda tokenizer_input_start,x	// Get the first character
st_test_A:
	cmp #$41                        // Screen code for A
	bne st_test_B			// If we don't have A, skip
	jmp scene_tokenize_A		// Tokenize A if we do
st_test_B:
	cmp #$42			// Screen code for B
	bne st_test_C			// If we don't have B, skip
	jmp scene_tokenize_B		// Tokenize B if we do

This code snippet only describes the first two letters but is identical for all characters that appear in the current scene. It’s the type of code that lends itself perfectly to being generated, which is something I’m looking into as a next step.

Tokenizing a word

A word gets tokenized by continuing the walk across the text input line from the player and evaluating each character according to the word list in the game.

scene_tokenize_G:
    inx				// Move the cursor one position
    lda tokenizer_input_start,x	// Load the next character
    cmp #$64			// Did we catch the cursor character?
    beq ctG_end_prechecks	// We got "G_", stop tokenizing.
ctG_checkspace:
    cmp #$20			// Did we catch whitespace?
    beq ctG_end_prechecks	// We got "G _", stop tokenizing.
ctG_end_prechecks:
    jmp ctG_done		// We got nothing for now, break out.
ctG_gotword:
    cmp #$41			// Did we get an A?
    beq ctG_GA			// We got an A, still running for "GAZE"
    jmp ctG_done		// We got nothing, break out
ctG_GAZ:
    inx
    lda tokenizer_input_start,x	// Load the next character
    cmp #$5A			// Did we get a Z?
    beq ctG_GAZE		// Yes: set up for "GAZE"
    jmp ctG_done		// No: end tokenizer
ctG_GAZE:
    lda #$02			// Token for the word group of GAZE
    sta current_token		// Store it in the buffer
    jmp ctG_done		// Get out of here
ctG_done:
    rts

This code example handles words that start with “G” on the condition that the only word available in the current list is “GAZE”. Notice that we do not (yet) evaluate the entire word but only the first three letters. Whether this is something that’ll stay in the game is up to some extensive playtesting. As a player you could now enter “GAZ”, “GAZE” or even “GAZPACHO” and all of them would be treated the same. I’m not sure whether that is a problem or not, given the constraints.

You’ll notice that this code is also very repetitive so I’m looking into writing a generator for this code as well. It’s very error-prone to code all of this by hand, not to say extremely boring and difficult to adjust/insert words after the fact.