full-stack overflow

14 Feb 2018

Lightning Fast String Lookups With Tries

Live Demo

A trie is a search tree that is typically used to store string keys and facilitates fast lookup of values and partial matching. Tries can be used to store things besides words of course, like IP addresses in routers or numbers in algorithms, but for this tutorial, we’ll keep it simple and just use words.

For example, I have a large list of words. How many of those words match the prefix “ca+”, where + is any number of further letters? Or, say I want to match words that have letters existing in fixed positions but could have any letter in other positions. For instance, l*****s should match ‘letters’ and ‘ladders’ but not ‘levitations’.

You can think of a trie as an additional constraint added to a hash table. Unlike trees, the value of nodes is encoded directly into the tree structure itself rather than stored within the node’s properties. The position of a letter in a word is given by its distance away from the root. The second letter is at level 2, the third level at level 3, etc.

To store a tree with the words HALL, HALOES, HALO, HELL, and AIR, it’d look like this:

        *
       / \
      H   A
     / \   \
    A   E   I
   /     \   \
  L       L   R
 / \       \
L   O       L
     \
      E
       \
        S

You might be asking yourself: how do I even go about accessing these words (or possibly: how on earth is this structure advantageous)?

Well, consider that you’re looking for the word HALO. You break the word you’re searching for into individual letters. Then you search the Trie using each letter as a key. Tree['H'] brings you to

    A   
   /    
  L    
 / \   
L   O
     \
      E
       \
        S

and this Tree['A'] brings you to

  L    
 / \   
L   O
     \
      E
       \
        S

and this Tree['L'] brings you to

    O
     \
      E
       \
        S

then this Tree['O'] brings you to

      E
       \
        S

And we’re done.

(Not!) We actually ran into an edge case for storing data in tries: since multiple pieces of data (words) can be stored along the same path through the tree, how do we delineate individual words?

We can do this pretty easily: all we have to do is add a marker to the end of a word when we insert it to signify that it marks the “END” of a given word.

Then when we search through the tree, we’ll keep iterating through the characters, using each one as a key to move farther down the tree. If a key does not exist at a given level, the word is not in the tree, so we return false.

If we exhaust characters in the word we’re searching for and are still at a valid node, we check for the “END” marker. If it’s present, the word exists in the tree. If it’s not, the word doesn’t exist, even though its letters do.

         *
        / \
       H   A
      / \   \
     A   E   I
    /     \   \
   L       L   R^
  / \       \
L^   O^      L^
      \
       E
        \
         S^
Here’s the original diagram again, where an up-caret (^) signifies a word ending.

Note that this algorithm makes adding words to the tree whose paths already exist quite simple. Looking at the current tree, we could add the words “HE”, “AI” and “HA” simply by adding three end properties to existing nodes.

Let’s go ahead and code up insert, search, and delete.

Basic Implementation

var Trie = function () {
  this.nodes = {};
  this.length = 1;
};

Trie.prototype.getLength = function () {
  return this.length;
};

Our Trie is a simple object with nodes and a length. Each node object is in fact just a simple dictionary: it doesn’t need to be any more complicated than this.

Insert

To insert a word into the trie, we split it into individual characters and then convert them all to upper case. If we didn’t do this, we’d double the number of possible keys in the tree if our insertions could be in mixed case, as well as increase the number of possible ways the same letters could be stored by a factor of word.length!. Consider:

CAT vs Cat cAt caT CAt cAT CAT

I like cats, but that’s just ridiculous.

Trie.prototype.insert = function (word) {
  let node = this.nodes;
  word = word.split("").map((e) => e.toUpperCase());
  let idx = 0;
  let newWord = false;
  while (idx < word.length) {
    if (node[word[idx]]) {
      node = node[word[idx]];
    } else {
      // inserting a new word
      newWord = true;
      node[word[idx]] = {};
      node = node[word[idx]];
    }
    if (idx === word.length - 1) {
      node.end = true;
    }
    idx++;
  }
  if (newWord) {
    this.length++;
  }
};

We set the current node equal to the trie root and then check to see if the first letter of the word exists as a key. If it does, we set the current node equal to this next node. If it doesn’t, we create a node, and then we set the current node equal to the next node.

A new node is just an empty JS object {} as our nodes are just dictionaries. We don’t need anything fancy here. We also have a flag that we set to true when we create a new node so that we can increment the length of the trie accordingly.

We continue this process for every letter in the word, including the last.

When we get to the last letter in the word (index of word.length-1), we set the end property on the node to true to signify the word we’re inserting ends here.

Insertion is pretty simple.

Search and Delete

Delete and search have linked functionality, so let’s discuss them at the same time.

You might think that deletion is simply deleting the end property from a given node. That works as a form of “lazy deletion”. What makes it lazy? By toggling the end flag, we are potentially leaving extra letters in the tree that no longer belong to any word.

Consider our original tree.

         *
        / \
       H   A
      / \   \
     A   E   I
    /     \   \
   L       L   R^
  / \       \
L^   O^      L^
      \
       E
        \
         S^

To delete ‘HALO’, it is sufficient to just remove the flag from the O.

         *
        / \
       H   A
      / \   \
     A   E   I
    /     \   \
   L       L   R^
  / \       \
L^   O      L^
      \
       E
        \
         S^
Halo was deleted.

What about deleting ‘HALOES’? Here’s how we’re going to work through it. When we search for a node, if that node doesn’t exist in the tree, we return false. No big deal.

If the node does exist though, we’ll return information further up the tree. We can safely delete nodes from the tree at the end of the word until we reach either:

  1. A branch point
  2. The end of a previous word

Branch points are nodes with more than one key that are not located at the end of the word we are deleting.

A branch point is a node that has more than one key. The end of a previous word is also a node that has more than one key (we add an end key to the node). So we can consider both of these cases to be branch points.

If there are no branch points on the path to a word, we can delete the entire word without consequence. AIR for instance. Otherwise, we can delete all letters up to but not including the last encountered branch point. HALOES, for instance: delete O, E, S.

         *
        / \
       H   A
      / \   \
     A   E   I
    /     \   \
   L       L   R^
  /         \
L^           L^
Haloes was deleted.

We need a search function that will return:

  1. Whether or not a word exists in the trie
  2. If so, it should return:
  • the node at which the most recently-encountered branch occurred, as well as the next letter in the word after this branch
  • the node at which the word ends
Trie.prototype.search = function (word) {
  word = word.split("").map((e) => e.toUpperCase());
  let node = this.nodes;
  let lastEnd = null;
  for (var i = 0; i < word.length; i++) {
    if (!node[word[i]]) {
      return false;
    } else {
      node = node[word[i]];
      if (Object.keys(node).length > 1 && i !== word.length - 1) {
        lastEnd = { node: node, key: word[i + 1] };
      }
    }
  }
  return node.end ? { result: node, lastEnd: lastEnd } : false;
};

Trie.prototype.find = function (word) {
  return this.search(word) ? true : false;
};

We convert the query to uppercase, since that’s how we store our keys. Then we ensure that keys exist for every node in the tree. If the tree is missing a node, we return false, since the word doesn’t exist. Otherwise, we move on to the next node.

Note this line: if (Object.keys(node).length>1 && i!==word.length-1)

This block executes to record the last end or branch encountered as we walk down the tree.

We make sure that we haven’t already reached the end of the node before assigning lastEnd (otherwise, lastEnd will always just equal node).

When storing this last node, I include not only a reference to the node, but also the key from which to delete. It’s the letter one past the point at which the node branches.

We return both the result and the lastEnd if the word is found. We ensure that when we’ve reached the final character, its end flag is set to true.

Finally, I wrapped the search function in a find method that we can publicly expose: a find that returns true/false is a little less messy than a search that returns an object that other methods use internally.

Now for the delete function.

Trie.prototype.delete = function (word) {
  let findWord = this.search(word);
  if (!findWord) {
    return false;
  } // 1
  if (findWord.result.end) {
    delete findWord.result.end;
  } // 2
  if (findWord.lastEnd) {
    // 3
    const key = findWord.lastEnd.key;
    delete findWord.lastEnd.node[key];
  }
  this.length--;
  return true;
};

There are three cases when we try to delete a node.

  1. It doesn’t exist. Do nothing.
  2. It exists, and it is the last element in a path. Delete the end property.
  3. It exists, and it is not the last element in the path. We need to delete the end property as well as all other elements up to the end that will no longer be used.

These cases correspond to the three if statements above. Note the need to store the deletion key here. This saves us the work of having to store pointers between nodes so that we can traverse backwards through the tree to identify parents.

Printing out words in the trie

In order to print words in the trie, we’ll use a recursive method that implements pre-order depth-first traversal. We covered different tree traversal methods in a previous tutorial on binary search trees.

First we’ll define a getNode helper function. It works a little like search, but all it’s going to do is return a given node in the tree.

Trie.prototype.getNode = function (word) {
  word = word.split("").map((e) => e.toUpperCase());
  let node = this.nodes;
  for (var i = 0; i < word.length; i++) {
    if (!node[word[i]]) {
      return false;
    } else {
      node = node[word[i]];
    }
  }
  return node;
};

Now let’s write a function to print out all the words in the trie.

For each node, we get all the keys at that node. Then we iterate over all these keys. If one of the keys equals an endpoint, we push the word we’re currently building into the completed words list.

Otherwise, we put the partial word plus the current letter (word+keys[i]) back into recursePrint and move to the next node node[keys[i]].

Trie.prototype.printWords = function () {
  let wordList = [];
  let word = "";
  recursePrint(this.nodes, word, wordList);
  function recursePrint(node, word, list) {
    let keys = Object.keys(node);
    for (var i = 0; i < keys.length; i++) {
      if (keys[i] === "end") {
        list.push(word);
      } else {
        recursePrint(node[keys[i]], word + keys[i], list);
      }
    }
  }
  return wordList;
};

Find

We can use our printWords method as a basis for writing two more specific find methods.

By wildcard

The first will find words that have wildcards in them. For instance, calling the function with tr** would return both “tree” and “tree”.

Let’s think about what we’re doing here. It’s similar to a search, except at each level, we narrow our search space by the constraints given by the word that we’re looking to match.

Let’s go back to the example tree.

         *
        / \
       H   A
      / \   \
     A   E   I
    /     \   \
   L - T^  L   R^
  / \       \
L^   O       L^
      \
       E
        \
         S^

If we search for HA**, we’ll get HALL, HALT, and HALO.

We want to restrict the keys at the first level to H, and exclude A. We want to restrict the keys at the second level to A and exclude E. Then on the third level, we want to include all possible keys: L. On the fourth level, we want to include all possible keys: L, T, and O.

Thinking about this in terms of a function:

Trie.prototype.wildcardFind = function (query) {
  query = query
    .replace(/[^a-z*]/gi, "")
    .split("")
    .map((e) => e.toUpperCase());
  let wordList = [];
  let word = "";
  recursePrint(this.nodes, word, wordList);
  function recursePrint(node, word, list) {
    let idx = word.length;
    if (word.length > query.length) {
      return false;
    } else {
      if (Object.keys(node).includes("end")
           && word.length === query.length) {
        list.push(word);
      }
      let keys = query[idx] == "*" ? Object.keys(node) : [query[idx]];
      for (var i = 0; i < keys.length; i++) {
        recursePrint(node[keys[i]], word + keys[i], list);
      }
    }
  }
  return wordList;
};

It looks much like printWords. The only big differences are:

  1. We sanitize the user input, stripping out invalid (non-alphanumeric) keys.
  2. If the word we’re assembling is larger than the search term, we return false (matches must be the same number of characters).
  3. If the word we’re assembling matches the search term length and the node has an end flag set, we accept the word.
  4. We set the keys at each level dynamically based on the length of the word that we are assembling. For each call to recursePrint, we set idx equal to word.length. This value grows by one with each level we descend down the trie. This value also corresponds directly to the key we wish to choose at that level.
  • If the search term has an asterisk (wildcard) at that level, we iterate over all keys at that level (Object.keys(node)).
  • If it has a character at the level, we only iterate over the single key corresponding to that character.

You can think of this as a depth-first traversal that visits a desired subset of nodes based on a filtering of the keys.

By prefix

Searching by prefix is very similar to searching by wildcards. The only difference is that we already have part of the word in hand, so there’s no need to start at the root node of the tree (this.nodes) like we did in wildcardFind.

Instead, we use getNode to obtain the prefix root and begin traversing from there.

Trie.prototype.prefixFind = function (prefix, desiredLength = null) {
  let wordList = [];
  prefix = prefix.replace(/[^a-z]/gi, "").toUpperCase();
  recursePrint(this.getNode(prefix), prefix, wordList);
  function recursePrint(node, word, list) {
    let idx = word.length;
    if (desiredLength && word.length > desiredLength) {
      return false;
    } else {
      let keys = null;
      if (Object.keys(node).includes("end")) {
        if (!desiredLength) {
          list.push(word);
        } else if (word.length === desiredLength) {
          list.push(word);
        }
      }
      keys = idx < prefix.length ? [prefix[idx]] : Object.keys(node);
      for (var i = 0; i < keys.length; i++) {
        recursePrint(node[keys[i]], word + keys[i], list);
      }
    }
  }
  return wordList;
};

One key difference is that we allow the user to specify whether or not there is a desired word length, which is off by default.

We use this desiredLength as an optional check on the words we build up, quitting early if the word exceeds or does not meet the desiredLength.

Consider how you might modify this function slightly for autocomplete: maybe you want to suggest words that are 2 or 3 characters longer than the current word entered, so desiredLength could be set dynamically. This would also require some nuance: at a certain point, you’d want to stop increasing desiredLength, since you’ll overshoot the suggestions and display zero results, even when the user has typed in a suggested value.

Fun exercises:

  • Implement dynamic desiredLength for more responsive autocomplete
  • Implement wildcards in prefixFind so that, for example, prefixFind('*') returns everything in the trie, and prefixFind('*',4) returns all 4-letter words in the tree.

Using the Trie + CodePen Demo

We instantiate a tree object with let T = new Trie(); and insert words with T.insert('word'). There’s not much to it! The best way to see it in action is with the CodePen demo up top, where we put a ton of words in the trie and then use prefixFind to perform typeahead lookup in real-time. I added a few calls to Performance.now() (Docs) so you can see how snappy our trie functions.

Wrapping Up

Congratulations! You’ve gotten an introduction to another simple data structure, the trie.

There are many variations of tries. We can encode the data we store in a trie more compactly, as well as store nodes that are only used by single elements to make the tree shorter. We’ll discuss more later, but for now, consider our original tree:

        *
       / \
      H   A
     / \   \
    A   E   I
   /     \   \
  L       L   R
 / \       \
L   O       L
     \
      E
       \
        S

What if we consolidated non-shared nodes like this:

        *
       / \
      H   A
     / \   \
    AL  ELL IR
   / \    
  L   OES

4 levels tall instead of 7!

How would insertion and deletion change? (Hint: splitNodeKeys and joinNodeKeys). How would we perform lookup? How would we decide what lengths of keys to allow per level, and would we do this dynamically or fix the size? To be continued.