Trie

In this post will document about prefix trees also called as Trie. A Trie according to wikipedia is defined as A prefix trie is an ordered tree data structure used in the representation of a set of strings over a finite alphabet set, which allows efficient storage of words with common prefixes. In this definition important aspects of a Trie are:

  1. Ordered Tree data structure.

Let’s look at the first property Ordered Tree data structure. A trie is tree data structure, but rather than tree node having two children as in a binary tree, trie node has children equal to the size of the alphabets in the string. For example for lower case english string this would be [a-z] so 26 children maximum, for binary string this will be 0,1 so 2 children and for integer string this will be 0-9 so 10 children. Also as mentioned there is ordering in the tree data structure, this is similar to binary search trees, trie can be traversed as in-order traversal to order the strings in lexicographic order. For english lower case strings aa will be lower than a which is lower than ab.

  1. Represents strings with finite alphabet set.

As defined previously each tree node will have children which will be maximum to the size of the alphabet. It is important that the alphabets in the string are finite and well defined. For example lower case english letters, or binary strings. As the width of the tree is defined by the size of the alphabet a tree with high cardinality will be very expensive in terms of memory. As every level alphabet size gets exponential wide. For example for 26 characters if the height of the tree is d, then the number of children will be 26^d, so every level adds node which are exponential to the size of the alphabet. The height of the tree is defined as length of the longest string in the trie.

  1. Used for finding strings with common prefixes.

When a string is inserted in the trie, each character in the string is defined by a trie node at a separate level as we go down from the root to the leaf of the tree. Once the end of the string is reached, the node is marked with a marker, to identify a string ending at the node. This node although is the end node of the string but is prefix for many other strings which are in the sub tree of that node. This property can be used to find strings for applications like auto-complete where a prefixed is typed and a list of strings starting with the prefix are returned. Such list can be generated by visiting all the nodes in the prefix sub-tree of the trie. So trie supports search(key: str) -> bool operation on the string, but also supports startsWithSearch(prefix: str) -> List[str] operation.

Now lets see, how to define a TrieNode

class TrieNode:
    def __init__(self):
        self.children = {}
        self.isWord = False

I have defined the children in the node as a dictionary rather than fixed size array. This can help saving memory, in sparse tries, where only few characters are available at each level. Also a boolean is added isWord, which represents a node is the end character of the word inserted in the trie. If the isWord is true then the word is present in trie, else the node is just part of the prefix of another word.

Now let’s look at the basic template for traversing over the trie, with an example of insertion. Template can be modified a little for other functions on the trie.

# Time Complexity: O(len(word)) or O(d) => Where d is the length of the string.
# Space Complexity: O(26^d), In worst case if the trie is filled with all the strings in english dictionary, where `d` is average string length.

def trie_insert(root: TrieNode, word: str):
  node = root
  for ch in word:
    # Handle the case if the current character of the word doesn't have a node in the trie
    if ch not in node.children:
      # Actions can be: Create node for character, or return false in case searching.
      node.children[ch] = TrieNode()

    # Select the node for current character as the next node, basically moving down the tree.
    node = node.children[ch]

  # After the node representing the word is reached.
  # Take action: Example mark it is as a valid word, check if isWord is true for searching, or delete the word by marking isWord as False.
  node.isWord = True

root = TrieNoode()
trie_insert(root, "hello")

Now will discuss some of the problems of Trie and use the above template.

Based on the template given above, I have implemented the function of insert, search and startsWith as a part of this problem. The template has been modified a little based on requirements.

class TrieNode:
    def __init__(self):
        self.children = {}
        self.isWord = False

class Trie:

    def __init__(self):
        self.root = TrieNode()

    def insert(self, word: str) -> None:
        node = self.root
        for ch in word:
            if ch not in node.children:
                node.children[ch] = TrieNode()
            node = node.children[ch]
        node.isWord = True

    def search(self, word: str) -> bool:
        node = self.root
        for ch in word:
            if ch not in node.children:
                return False
            node = node.children[ch]
        return node.isWord


    def startsWith(self, prefix: str) -> bool:
        node = self.root
        for ch in prefix:
            if ch not in node.children:
                return False
            node = node.children[ch]
        return True

This problem is an extension of the above problem, where rather than just telling if the word exists or not, the problem asks for count of the words and prefixes in trie. This can be done by updating the trieNode to add wordCount representing the count at the node where the end of the word exists and adding prefixCount with the count of all the words which under the sub tree of a particular node. These both counts can be updated during the append method, where wordCount is incremented when the end node is reached. And the prefixCount at every node in the path for every character of the word. So essentially the root node will have total count of all the words in the whole trie.

The problem also talks about erase method, which is just opposite of the append, where the values are decremented. Also, removing the keys from children in the path where there is only word which is to be deleted, is an optimization for removing the space after the word has removed.

class TrieNode:

    def __init__(self):
        self.children = {}
        self.wordCount = 0
        self.prefixCount = 0

class Trie:

    def __init__(self):
        self.root = TrieNode()

    def insert(self, word: str) -> None:
        node = self.root
        for ch in word:
            node.prefixCount += 1
            if ch not in node.children:
                node.children[ch] = TrieNode()
            node = node.children[ch]
        node.wordCount += 1
        node.prefixCount += 1


    def countWordsEqualTo(self, word: str) -> int:
        node = self.root
        for ch in word:
            if ch not in node.children:
                return 0
            node = node.children[ch]
        return node.wordCount


    def countWordsStartingWith(self, prefix: str) -> int:
        node = self.root
        for ch in prefix:
            if ch not in node.children:
                return 0
            node = node.children[ch]
        return node.prefixCount

    # In problem, Erase assumes that word is always present in trie, otherwise we would need to search the word, before erasing.
    # Which could have been done by checking countWordsEqualTo(word) == 0
    def erase(self, word: str) -> None:
        #if self.countWordsEqualTo(word) == 0:
        #    return
        node = self.root
        for ch in word:
            node.prefixCount -= 1
            # Word will always be present, so no checking of ch children is needed.
            # If prefixCount of ch children is 1, then that is part of the prefix of which the word is getting erased.
            # In that case the ch itself can be just deleted from the children to remove sub tree and return, the memory will then be GC.
            if node.children[ch].prefixCount == 1:
                del node.children[ch]
                return
            node = node.children[ch]
        node.wordCount -= 1
        node.prefixCount -= 1

This problem, to find the suggestions for every prefix of the given in the dictionary of word and give results in lexicographic order.

As discussed earlier in the property of tries, an in-order traversal i.e. traversal in the increasing order of the alphabets will give the words in lexicographic order. Based on that prefix search method has been updated to in-order traversal by running dfs on every prefix of the given word.

ALPHA = "abcdefghijklmnopqrstuvwxyz"

class TrieNode:
    def __init__(self):
        self.children = {}
        self.isWord = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word):
        node = self.root
        for ch in word:
            if ch not in node.children:
                node.children[ch] = TrieNode()
            node = node.children[ch]
        node.isWord = True

    def inOrderWords(self, node, maxCount, preWord, sol):
        # If node isWord, One of the word is found which is stored in preWord as char array, join it and add to solution
        if node.isWord:
            sol.append("".join(preWord))
            if len(sol) == maxCount:
                return
        # Iterate over all the characters in alphabet, check which one is present as children of node and traverse it's sub tree.s
        for ch in ALPHA:
            if ch in node.children:
                # append the character which is child of the node.
                preWord.append(ch)
                # Run recursively on the sub tree of the current character
                self.inOrderWords(node.children[ch], maxCount, preWord, sol)
                # Pop the character out, after recursive call.
                preWord.pop()
                # If the solution already contains maxCount, preempt the search and return
                if len(sol) == maxCount:
                    return

    def searchPrefix(self, prefix, maxCount) -> List[List[str]]:
        res = []
        node = self.root
        preWord = []
        n = len(prefix)
        for i, ch in enumerate(prefix):
            if ch not in node.children:
                # Can't find the prefix any further, so there won't be any further words
                res.extend([[] for _ in range(n - i)])
                return res
            # Add ch character to the prefix found till now
            preWord.append(ch)
            sol = []
            # Run in order search on ch sub tree to get the maxCount words in lexicographic order.
            self.inOrderWords(node.children[ch], maxCount, preWord, sol)
            res.append(sol)
            # Make ch as the current node
            node = node.children[ch]
        return res

class Solution:
    def suggestedProducts(self, products: List[str], searchWord: str) -> List[List[str]]:
        trie = Trie()
        for product in products:
            trie.insert(product)
        maxCountWords = 3
        return trie.searchPrefix(searchWord, maxCountWords)

comments powered by Disqus