Interpreter Project: Scanner

Finally, some actual interpreter code to show. The first part of a language front end is to scan the source code and turn it into tokens. This step is also sometimes called 'lexing' or 'lexical analysis'.

You can find my code at this point in the project here.

Some Highlights

I'm trying to stick close to the structure @munificentbob uses in Crafting Interpreters. However, some changes are needed because of the way I set up the Ruby project, and some are just changes that Ruby make possible.

For instance, I'm using Ruby symbols rather than an enum to represent the Token types. Ruby doesn't have the enum data type, so there are two ways to go; symbols or constants. I chose symbols because they are more idiomatic Ruby. Constants would have cluttered up the code too much.

class TokenType
  LPAREN = 1
  RPAREN = 2

# In the scanner creating tokens would look like:

With symbols it looks cleaner.

# No need for a TokenType class to hold the constants

One benefit of using constants would be that all of the token types are in a single location and easy to find. Right now the token type symbols are spread out all over the scanner code. It doesn't seem to be an issue yet, but I may revisit this code if it becomes a problem.

One other major change is in the scanner code. The Java switch statement can't handle a regex as one of the cases, so instead of having a separate case for each digit, 0 through 9, it falls to the default case.

    if (isDigit(c)) {
    } else {
      Lox.error(line, "Unexpected character.");

This isn't a problem in Ruby. You can just use a regular expression.

  when /[0-9]/

I may go back and change that to the newer POSIX bracket expression for digits /[[:digit:]]/. Currently the scanner will only detect ASCII digits. With the POSIX digit character class it could also detect Unicode numbers. Not sure that will actually be an issue in this little project though.

Moving Forward

The next step is parsing. The tutorial has a chapter on how to represent code in a data structure, so it might take a little longer to get to and through the parsing code.