Recognizing Keywords

Published on Jun 29, 2018

In the compiler theory books I have there is generally very little attention given to how the scanner should recognize the keywords of the language. How does the scanner tell that super is a keyword and supper is an identifier?

If you are using a tool to generate the scanner code you generally don't have to worry about it. You supply a list of your keywords to the tool and it generates the necessary code to check for them in the code it generates.

When you are writing the scanner by hand, as I am for Cube it becomes a little trickier. The obvious thing to do is store each keyword as the key in a hash table. Then you can just scan a token as if it is an identifier, and then check to see if it exists as a key in the hash table. If the identifier is in the hash table it is a keyword, otherwise it is an identifier. This is the approach that was taken for Ringo and it worked nicely. However that was Ruby which has very strong support for hash tables built in.

Cube is being written in C which means I'd either have to come up with another method to check for keywords, or I'd have to write a hash table myself.

My initial thought was to write a general hash table ADT that I could use other places in the project as well, but the more I thought about it the more that seemed like overkill. I probably will need to write a generic hash table at some point in this project, but for now I can get by with a simple keyword checker.

Funmbling Around

After a little research on hash functions I was ready to start my first stab at this. I will show shortened versions of some of the data structures below to keep it short. I wasn't sure exactly what shape this little module would take, so I started as simple as possible. The keywords are stored in an array.

char *keywords[] = {
  "begin",
  "break",
  ...
  "until",
  "while",
};

The keyword hash values, which I calculated with a little utility function, are stored in another array.

int keyword_hashes[] = {
  2504,
  3748,
  ...
  1255,
  6530,
};

The is_keyword function returns true if the passed in word is a keyword, otherwise it returns false.

bool is_keyword(const char *word)
{
  int value = hash(word);
  for(int i = 0; i < NUM_KEYWORDS; i++)
  {
    if(keyword_hashes[i] == value)
      return verify_keyword(word, i);
  }

  return false;
}

There are some obvious improvements that can be made here. Some parameter sanity checking of course, like checking for a NULL pointer and a zero length string. Also, it can potentially check every keyword hash if the passed in word is not a keyword. It would probably be a good idea to sort the keyword_hashes array and do a binary search rather than just looping through the array. In my defense, there are only 23 keywords, and it isn't doing slow string matching, it is comparing integers which is much faster.

The other major problem with this approach is that the is_keyword function only returns a Boolean value. What the scanner actually needs is the TokenType enum value for the keyword, if it is a keyword. To get around this problem initially I wrote a function to convert the keyword to its TokenType value.

int keyword_type(const char *word)
{
  int value = hash(word);
  switch(value)
  {
    case 2504:
      return TOKEN_BEGIN;
    case 3748:
      return TOKEN_BREAK;
    ...
    case 1255:
      return TOKEN_UNTIL;
    case 6530:
      return TOKEN_WHILE;
    default:
      return TOKEN_ERROR;
  }
}

This worked well, but even at the time I felt like it was not ideal. The hash value is being calculated twice now, once in the is_keyword function and once in the keyword_type function. I checked in this code and decided to think a bit on how to improve it.

Improvements

I tend to lay in bed listening to music for a short time after my alarm goes off. A lot of the time, code that I'm working on will spin around in my head during this lazy morning ritual. This was how I realized how to fix my Keywords module. I just needed a structure to store all of the relevant keyword information. That would simplify much of the code, and should allow me to remove that annoying keyword_type function.

The Keyword stucture holds all the relevant data to identify a keyword and the keyword's token type.

typedef struct
{
  int token_type;
  int hash;
  char *word;
} Keyword;

An array of those structs is declared and initialized as follows.

Keyword keywords[] = {
  { TOKEN_BEGIN, 2504, "begin" },
  { TOKEN_BREAK, 3748, "break" },
  ...
  { TOKEN_UNTIL, 1255, "until" },
  { TOKEN_WHILE, 6530, "while" }
}

Nice, everything easily accessible in one place. Now I needed to change the is_keyword function to something else. It wont be returning a Boolean value any longer. Now it will either return 0 if the passed in word is not a keyword, otherwise it will return the token type of the keyword. I renamed the function to find_keyword.

int find_keyword(const char *word)
{
  // Error checking elided.

  int value = hash(word);
  for(int i = 0; i < NUM_KEYWORDS; i++)
  {
    if(keywords[i].hash == value)
      return verify_keyword(word, i);
  }

  return 0;
}

It looks pretty similar to the previous function. I added some error checking code to check for NULL and empty strings. Also, the keywords range from two to six characters long. If word is shorter or longer than that range we know it isn't a keyword.

The verify_keyword function was also updated to return the token type or zero.

static int verify_keyword(const char *word, int index)
{
  if(memcmp(word, keywords[index].word, strnlen(keywords[index].word, LEXEME_LEN)) == 0)
    return keywords[index].token_type;
  return 0;
}

Again, we have to verify that the passed in word actually matches the keyword because there is a chance the hash function could generate the same hash value for a different word. If the passed in word matches the word in the keywords array we know it is a keyword and we return keywords[index].token_type.

This is a big code improvement over the first iteration. The data is all packed into a single struct and it is obvious how the different elements relate to each other. Will it perform well when scanning large amounts of source code? I'm not sure yet. I'll definately do some profiling of this code when the project is a little further along.

If you want to see the full code here is the header and the source.