Title image

Strings Matching. Next level

Andrey Gein / 2024

Hello! Long time no see!

Let’s continue our discussion on strings and methods to find them in large texts. You should read this lecture after the first part, where we covered the problem of searching for a specific pattern in a specific text. Here are our main takeaways from that:

All of these algorithms perform well, and sometimes exceptionally well, when we only need to search for a pattern in the text once. But what if we want to search for multiple patterns, one after another, in the same text? For example, try again pressing Ctrl+F (or ⌘+F on macOS) and slowly search for the word “text” on this page by typing it letter by letter. You’ll notice the browser updates the count of occurrences after each keystroke. Does it really have to scan through the entire text each time?

Today, we’ll discuss how to save time when searching if the text doesn’t change. We’ll also explore other string-related tasks and how to solve them using more advanced data structures.

Today, we will be using the same terminology as in the previous lecture. If you’ve forgotten any of it, I suggest revisiting the previous material to refresh your memory.

Additionally, we’ll introduce negative indexing notation inspired by Python. If S is a string, then S[-1] refers to the last character of the string, S[-2] refers to the second-to-last character, and so on. In essence, S[-n] is shorthand for S[len(S)-n].

Similarly, we can use negative indices in slices. For instance, S[:-1] represents the longest proper prefix of the string, meaning all the characters except the last one.

Throughout this lecture, we will also touch on a few concepts from graph theory, such as trees and acyclic graphs. While prior familiarity with these topics may help, the material should still be understandable without it.

Fun zone!

I asked language model to turn the text of this lecture into a song lyrics. Then, I asked another model to transform it into a country song, and add a video to it. Enjoy this song when you start getting tired of the lecture.

Note: All the text, music, and even video were generated by algorithms.

Aho–Corasick algorithm

So, this time, we have multiple patterns — Pattern1, Pattern2, ..., PatternK — that we need to search for within a single large string, Text. For example, you might want to scan a document for inappropriate language or implement a basic spam filter by looking for specific keywords.

If we were to search for each pattern individually using the algorithms from the previous lecture, this would take, on average, O(K×len(Text) + Σlen(Patterni)) time. In other words, we’d be scanning through the entire text repeatedly for each pattern, which isn’t very efficient. Let’s see if we can avoid this by searching for all the patterns simultaneously.

To do this, let’s recall how the Knuth–Morris–Pratt algorithm worked with a single pattern. The algorithm shifted the pattern along the text, looking for matches. When a mismatch occurs, the algorithm shifts the pattern so that the longest possible prefix of the pattern is aligned with the text characters just seen. Now, imagine applying the same approach but with multiple patterns at once:

Shifts in case of multiple patterns

We compare the text with the patterns as long as at least one pattern has a matching prefix. For instance, if we’re looking at the text "strong opinion", and we have patterns "string", "star", "ion", "ingress", "ring", and "road", we can advance 3 characters forward because there’s a pattern starting with "str", which matches the beginning of the text. However, when the next character "o" doesn’t match any pattern, we need to shift all the patterns accordingly.

Instead of comparing each Text’s character against all patterns individually (which would be slow), we can perform some preprocessing to speed this up!

Imagine we are searching for a word in a dictionary, where each volume corresponds to a letter of the English alphabet. To find the word, we would grab the volume corresponding to the first letter, open the section for the second letter, and locate the page for the third letter, and so on.

Patterns organized as volumes

Let’s organize our patterns in a similar data structure. This will allow us to quickly determine if a word exists in our set, and if not, find the longest prefix of it that does exist. This data structure, which looks like a tree with characters on its edges, is called a trie (pronounced “try”). The trie contains all patterns (which can be read by following paths from the root downward), as well as all prefixes of all patterns. Moreover, any string read from the root downward is a prefix of some pattern, and it can be uniquely identified by the node we reach. Each prefix of a pattern corresponds to exactly one node, and if multiple patterns share the same prefix, they will also share the same node.

Trie build from patterns

Moreover, when we refer to a “node S”, we mean the node that corresponds to the string S. Every node, except for the root, has a “parent” node that is one level higher. The parent node corresponds to the string S[:-1].

Some terms of the trie

The concept of the trie is valuable on its own, even beyond the pattern-matching problem. For instance, tries are used for autocomplete when you start typing a word or command, and the program suggests possible completions in alphabetical or relevance order. Check out shivamMg/trie for an example of an implementation that additionally supports fast fuzzy searching thanks to trie.

So, we have a trie built from the patterns "string", "star", "ion", "ingress", "ring", and "road". Let’s return to our original task: given the text "strong opinion", we align all the patterns with the start of the text and aim to find the longest common prefix between the text and at least one of the patterns. To do this, we start from the root and follow the trie links based on the characters in the text: "s", "t", and "r". Now, we’re at the node representing the string "str", but we encounter the character "o" in the text, for which there is no transition in the trie. This means none of the patterns match at this point, and we need to shift the patterns:

Unknown shift after mismatch

But how far should we shift them? In the Knuth-Morris-Pratt algorithm, we used the prefix function 𝜋, which was equal to the length of the longest suffix of the current segment that was also a prefix of the pattern. Here, we can calculate a similar function for our trie! Think about how to generalize the definition of the prefix function for the case where we have a trie built from multiple patterns.

 

In this case, the suffix of the text just examined must be a prefix of at least one pattern (just one, not necessarily all of them)! Thus, we need to find a proper suffix of "str" that exists in our trie. In that case, it is a node "r", because we have patterns "road" and "ring", but don’t have any pattern starting with "tr"—another suffix of "str".

Such a transition from one node (corresponding to the currently examined part of the text, "str") to the next one (node "r" in our case) is called a suffix link. Let’s take a look at how suffix links appear in our trie; they are indicated by a dotted line:

Suffix links

Note that a suffix link is always defined. If there are no suitable suffixes in the trie, it leads to the root corresponding to the empty string. I skipped such arrows in the image above, otherwise it would become a mess :-)

Before we move on, let’s take a moment to explore suffix links further. Imagine we’ve already constructed them in some manner—perhaps using the simplest and slowest algorithm that checks all suffixes of each suffix. How could we find all suffixes of the string S that are present in the trie?

 

Indeed, it’s sufficient to follow the suffix links starting from the node S, which means looking at S, then the suffix link from S, the suffix link from the suffix link, and so on, until we reach the root. Suffix links always lead to less deep nodes, so we will gradually reach the root, which represents the empty suffix, and stop there.

Suffix path
As you can see, suffixes of "string" presented in our trie are "ring", "ing", and "".

Note that it’s advisable to build the suffix links only after all the strings have been added to the trie. Otherwise, the suffix links may change with the addition of a new string, which would require extra memory or time.

Exercise

Take a piece of paper or your favorite image editor and build a trie for strings "fish", "fisher", "fishing", "infinity", "he", and "she". Then, add suffix links to the trie.

How many nodes are in the suffix trie?

 

How many suffix links don’t lead to the root?

 

Which string corresponds to the node that the suffix link from the "fishin" node leads to?

 

How to build suffix links?

It’s clear that using a naive algorithm to build suffix links by checking all suffixes for membership in the trie won’t work, as it would be too slow. Let’s come up with something more efficient!

Generally, we will proceed similarly to how we compute the prefix function in the Knuth–Morris–Pratt algorithm. For instance, the suffix link from the root is undefined, just as 𝜋(0) was undefined. The suffix links from the nodes at the first level lead to the root since they correspond to single-character strings, whose only proper suffix can be empty. As we move to the second level, things become more complex, as the suffix link from the nodes at the second level may either point to the root or to a node in the first level.

Let’s assume that the suffix links for all previous levels have already been calculated, and we want to find the suffix link for the node representing string S. Suppose it points to the node for string T, meaning T is the longest suffix of S that exists in the trie. Additionally, let’s say T is not the root, which means the suffix is not empty. Then both nodes in the trie have parents, represented by S[:-1] and T[:-1]. Notice that as T is a suffix of S, then string T[:-1] must be a suffix of S[:-1].

Suffix links building

This means that the node T[:-1] can be accessed in the trie from S[:-1] via suffix links! So let’s find T that way. We will follow the suffix links from the node representing S[:-1] until we find an edge leading down with the last character of string S. As soon as we find this edge, it will lead us to the desired node T, which we will declare as the suffix link for the node S.

Suffix links building

Self-checking question

What should we do if we traverse the entire suffix chain and don’t find an edge leading down with the last character of string S? Show answer

How to store a trie

Let’s take a moment to consider how we will store the trie in our program. We need to:

  1. Start with an empty trie consisting of only one root node, which corresponds to the empty string:
  2. Trie root
  3. Add patterns to the trie one by one. When adding a string, part of the path may already exist, while we will need to extend the trie by creating new nodes:
  4. Adding a word to the trie
  5. Be able to quickly find the destination of an edge from a given node based on a character.
  6. Aim to use memory efficiently, even for a large alphabet.

Before reading further, think about how you would store the trie to meet all these requirements.

 

We can represent a single node using the following class:

class TrieNode {
    val children = mutableMapOf<Char, TrieNode>()   // Child nodes
    var suffixLink: TrieNode? = null                // Suffix link
}
                

In this way, each node stores references to all its children, as well as a suffix link, which is initially undefined.

Creating the trie is straightforward: we simply create a root node without any children or suffix links.

val root = TrieNode()
                

Adding a string to the trie is also simple. We can traverse from the root and add nodes if they don’t already exist:

class TrieNode {
    val children = mutableMapOf<Char, TrieNode>()   // Child nodes
    var suffixLink: TrieNode? = null                // Suffix link
}

//sampleStart
val patterns: List<String> = listOf("string", "star", "ion", "ingress", "ring", "road");

fun addPatternToTrie(root: TrieNode, pattern: String) {
    var currentNode = root
    for (char in pattern) {
        currentNode = currentNode.children.getOrPut(char) { TrieNode() }
    }
}
//sampleEnd

fun printTrie(node: TrieNode, indent: String = "") {
    print("◯︎")
    val characters = node.children.keys.toList()
    if (characters.isEmpty()) {
        return
    }

    val firstCharacter = characters[0]
    val lastCharacter = characters[characters.size - 1]
    for (character in characters) {
        val child = node.children[character]!!
        if (character != firstCharacter) print(indent + (if (character != lastCharacter) "├" else "└"))
        print("── $character ──")
        printTrie(child, indent + (if (character != lastCharacter) "│" else " ") + "       ")
        if (character != lastCharacter) {
            println()
            println(indent + "│")
        }
    }
}

fun main() {
    val root = TrieNode()

    for (pattern in patterns) {
        addPatternToTrie(root, pattern)
    }

    printTrie(root)
}
                

Looks easy enough, yeah?

The code above is ready to run. Try it out, explore, and experiment with it a bit. Can you modify it so that it outputs "◉︎" for nodes corresponding to the end of some pattern?

Use the "Open in Playground" link to open an editable version of the code.

How much memory will our trie use?

Self-checking question

In a trie, just like in any other tree, the number of edges is one less than the number of nodes. So, estimate how many nodes are there in the trie? Show answer

Answer: It’s not more that Σlen(Patterni).

The total number of objects in the children dictionaries for all nodes is equal to the number of edges. Since we used a hash table, the actual memory consumption will be slightly higher.

If you want to save memory but are willing to sacrifice some speed, you can use sortedMapOf instead of mutableMapOf. On the other hand, if you want to maximize speed and don’t mind using more memory, you can declare children as an array sized according to the alphabet, allowing for very fast transitions from a parent node to a child node.

So, we have a trie, and it’s already stored in memory. The variable root holds its root, and from the root, we can reach all other nodes. We’ve also discussed how to build suffix links: we need to traverse the trie from top to bottom. This can be done, for example, using a breadth-first search aka BFS.

class TrieNode {
    val children = mutableMapOf()   // Child nodes
    var suffixLink: TrieNode? = null                // Suffix link
}

//sampleStart
fun buildSuffixLinks(root: TrieNode) {
    val queue = ArrayDeque<TrieNode>() // BFS queue

    // Initialize the BFS with the direct children of the root
    for (childNode in root.children.values) {
        childNode.suffixLink = root // Direct children of root have suffix links pointing to the root
        queue.add(childNode)
    }

    // BFS to build suffix links for the rest of the trie
    while (queue.isNotEmpty()) {
        val currentNode = queue.removeFirst()

        // For each child of the current node, set its suffix link
        for ((char, childNode) in currentNode.children) {
            var suffixCandidate = currentNode.suffixLink

            // Traverse suffix links until we find a node that has a matching child
            while (suffixCandidate != null && !suffixCandidate.children.containsKey(char)) {
                suffixCandidate = suffixCandidate.suffixLink
            }

            // Set the suffix link for the child node
            childNode.suffixLink = if (suffixCandidate == null) root else suffixCandidate.children[char]

            // Add the child node to the queue for further processing
            queue.add(childNode)
        }
    }
}
//sampleEnd
                

Self-checking question

What is the time complexity of this code? Show hint Show answer

Hint: BFS operates in O(number of nodes + number of edges), which in our case translates to O(Σlen(Patterni)). However, there is a while loop in the suffix link calculation that can potentially take a long time.

Answer: In fact, the algorithm for constructing all suffix links has linear complexity relative to the sum of the lengths of the patterns. This is justified similarly to the linear complexity of calculating the prefix function in the Knuth–Morris–Pratt algorithm. Let’s consider one specific Patterni and the path along which it is read from the root downward. Each time we descend one level, the depth of the suffix link can increase by at most 1. In the while loop, the depth of the suffix link must increase by at least 1 (and never becomes negative!). Since the depth will increase over time only by the length of the pattern, it can’t decrease by more than that amount either.

Patterns search in the text

Finally, our trie is built: it contains all the patterns and has the suffix links established. Now let’s return to our task: we want to find occurrences of these patterns in a given Text.

As a reminder, the algorithm starts with us moving through the text from left to right, checking for any string in our set of patterns that shares a prefix with the current portion of the text. To do this, we will traverse down the trie following the corresponding characters.

Search pattern in the trie

If we find ourselves in a situation where we cannot descend by a character, it indicates that none of the patterns share such a prefix, meaning no occurrences were found, and we need to “shift the patterns”. However, we must shift them in such a way that the portion of text we have read still corresponds to the prefix of one of the patterns—that is, it must exist in our trie. In other words, we need to jump along the suffix link, as that’s precisely why we constructed them!

Search pattern in the trie: suffix links

If, after following a suffix link, we still cannot transition down by a character, we will follow the suffix link again. We’ll keep doing this until we either find a suitable transition or reach the root of the trie. Note that going by the suffix link from node "str" to the node "r" means that we shifted our patterns by two characters ahead and continued to compare them with the text:

Search pattern in the trie: shift equivalence

Self-checking question

What does it mean if, while following the suffix links, we arrive at the root of the trie? Show answer

Answer: It means that we shifted all patterns to the next position and begin comparing them from the first character.

Now, how do we know that we have found one of the patterns in the text? This will happen when, during our traversal of the trie, we find ourselves at a node corresponding to one of the patterns. To simplify this check, let’s add a corresponding marker to our class:

class TrieNode {
    val children = mutableMapOf<Char, TrieNode>()   // Child nodes
    var suffixLink: TrieNode? = null                // Suffix link
    var isEndOfPattern = false // Marks if this node represents the end of a pattern
}
                

If, during the algorithm, we arrive at a node where node.isEndOfPattern = true, it means we have found an occurrence of the pattern in the text, and the anti-spam system may react!

Problematic case

The section’s title hints that everything isn’t as perfect as it seems. Can you provide an example where we might miss finding an occurrence of a pattern in the text?

 

This situation can occur when one of the patterns is a substring of another. Let’s consider an example: the patterns "man" and "humanity" will create a trie that looks like this:

Problem in case of patterns man and humanity

If the Text is "humanism", we will follow the left branch of the trie and fail to notice that we encountered the word "man", as we are “counting” on the more promising and longer prefix "human" from the pattern "humanity".

Problem in case of patterns man and humanity

Self-checking question

How can we fix the issue of missing "man"?

Indeed, we need to recognize that we have found a pattern in the Text not only when we reach a node corresponding to that pattern, but also when that node is accessible from the current node via suffix links. To avoid checking whether any pattern node is accessible via suffix links every time (which would significantly degrade the algorithm’s performance), we can store this information during the construction of the suffix links.

To do this, we’ll add a field called matchLink to each node, which will represent the nearest node accessible via suffix links that corresponds to a complete pattern.

Additionally, we’ll introduce a field called patternIndex to keep track of the index of the pattern that corresponds to a given node. This field is relevant only for nodes where a pattern ends, so when we find a substring, we will know exactly which pattern was matched.

class TrieNode {
    val children = mutableMapOf<Char, TrieNode>()   // Child nodes
    var suffixLink: TrieNode? = null                // Suffix link
    var matchLink: TrieNode? = null  // Match link for pattern matching
    var patternIndex = Integer?      // ID of pattern which ends at this node
}
                

And we can modify the code of suffix link calculating to define matchLink after suffixLink

class TrieNode {
    val children = mutableMapOf<Char, TrieNode>()   // Child nodes
    var suffixLink: TrieNode? = null                // Suffix link
    var matchLink: TrieNode? = null  // Match link for pattern matching
    var patternIndex = Integer?      // ID of pattern which ends at this node
}

fun buildSuffixLinks(TrieNode root) {
    val queue = ArrayDeque<TrieNode>() // BFS queue

    // Initialize the BFS with the direct children of the root
    for (childNode in root.children.values) {
        childNode.suffixLink = root // Direct children of root have suffix links pointing to the root
        queue.add(childNode)
    }

    // BFS to build suffix links for the rest of the trie
    while (queue.isNotEmpty()) {
        val currentNode = queue.removeFirst()

        // For each child of the current node, set its suffix link
        for ((char, childNode) in currentNode.children) {
            var suffixCandidate = currentNode.suffixLink

            // Traverse suffix links until we find a node that has a matching child
            while (suffixCandidate != null && !suffixCandidate.children.containsKey(char)) {
                suffixCandidate = suffixCandidate.suffixLink
            }

            // Set the suffix link for the child node
            childNode.suffixLink = if (suffixCandidate == null) root else suffixCandidate.children[char]

            //sampleStart
            // This code is located inside of the BFS loop in buildSuffixLinks().
            // Expand to see more.

            // If a new suffix link points to the end of the pattern, then set matchLink to the same node.
            // Otherwise, copy the value of matchLink from the suffix link node to our node.
            if (childNode.suffixLink!!.patternIndex != null) {
                childNode.matchLink = childNode.suffixLink
            } else {
                childNode.matchLink = childNode.suffixLink!!.matchLink
            }
            //sampleEnd
            // Add the child node to the queue for further processing
            queue.add(childNode)
        }
    }
}
                

Exercise

Prove that the search itself works in O(len(Text)) time.

Let’s break it down

The Aho-Corasick algorithm is a great way to search multiple patterns within a single text. Here’s a breakdown of how it works and why it is efficient:

Trie Construction
The first step in the Aho-Corasick algorithm is to construct a Trie or a Prefix tree. Each node in the Trie represents a character, and each path from the root to a node forms a prefix of one or more patterns. As a result, the trie contains all the patterns to be searched. This structure allows us to efficiently track where in the text a match could occur and helps us quickly verify prefixes of patterns.

Suffix Links
Once the Trie is built, we add suffix links to each node. A suffix link from node S leads to the node representing the longest suffix of S that is also a prefix of some other pattern. Basically, the idea of suffix links comes from the Knuth–Morris–Pratt algorithm. When a mismatch occurs, instead of starting from scratch, we can “jump” to the longest matching suffix, reducing unnecessary comparisons. Suffix links help to maintain linear performance when patterns share common prefixes or are nested inside each other.

Pattern Matching Process
Once the Trie and suffix links are built, the algorithm starts traversing the text, checking character by character. For each character in the text:

This allows us to always move forward through the text and avoid re-checking characters we’ve already processed.

Handling Matches
When we reach a node that corresponds to the end of a pattern, we know that a match has been found. Additionally, using match links, we can trace which exact pattern has been found. This helps when multiple patterns match at overlapping positions, ensuring no pattern is missed.

Efficiency
The Aho-Corasick algorithm has a time complexity of O(len(Text) + Σlen(Patterni)). This is due to:

This makes it an efficient solution for tasks that involve searching for multiple patterns simultaneously in large bodies of texts, such as spam filtering, because we can build a trie once, and then use it for new texts a lot of times.

Additional materials

I recommend checking out the following videos if you want to listen more about the Aho–Corasick algorithm:

Besides that, you might like this trie visualiser. Check it out!

Suffix trie

The Aho-Corasick algorithm works well when we have many patterns, and all of them are known in advance. In such cases, we can build a trie for these patterns and check new texts for occurrences of those substrings in time independent of the number and length of the patterns. Great!

But what if the situation is different? What if we have a fixed text, for example, the text of this page, a book’s content, or a project in IDE, and we occasionally want to search for various strings in it? Since we don’t know in advance which strings we’ll want to search for, we can’t build an Aho–Corasick’s trie in advance. On the other hand, we do know that the text won’t change, so we can preprocess it in a way that allows us to work efficiently with it later.

Exercise

Suppose we are given a Text. Try to think of what strings should be inserted into the trie so that after that we could quickly answer whether the some Pattern is contained within Text.

Think about an answer before scrolling further.

 

The core idea here is that we can insert all the suffixes of the Text into the trie! If Pattern exists in our Text, it will match the beginning of one of the suffixes, meaning there will be a node in the trie that corresponds to this pattern. Conversely, if a node corresponding to the string P exists in the trie, it means that some suffix of the text begins with P, implying that P is present in the Text.

This is how this suffix trie might look:

Suffix trie

Self-checking question

For what string we’ve built a suffix trie above? Show hint Show answer

Hint: Every string is equal to the longest suffix of itself.

Answer: This string is "mississippi".

Self-checking question

Estimate the number of nodes in a suffix trie of the string Text in general case. Show answer

Answer: In a suffix trie for the string Text, there are O(len(Text)2) nodes.

Self-checking question

What should the text of length N look like for the suffix trie to have the maximum possible number of nodes? Show answer

Answer: If the size of the alphabet is greater than or equal to N, any string in which all the characters are distinct will work. In this case, all the suffixes will start with different characters, diverging immediately at the first level of the trie. For smaller alphabets, the situation becomes more complex, but the idea is the same.

As a reminder, in the Aho-Corasick trie, we stored information in the nodes about whether a node corresponds to the end of any pattern. Here, we need to do something similar and store information about whether a node represents the end of any suffix. And, if so, which suffix it corresponds to (otherwise you might notice in the picture above that suffix "i" is lost across "ippi", "issippi" and "ississippi"). While this isn’t necessary to answer the question “Does the substring appear in the text?”, it is essential if we want to determine the exact position of the match.

To avoid storing such markers, there’s a clever trick: add a unique symbol to the end of the original text that doesn’t appear anywhere else in the string. You might need to add it to the alphabet, but that’s fine! This symbol is typically denoted by $, implying that it’s not a regular letter but a new, previously unknown character. For example, from the string "mississippi", we get "mississippi$".

Self-checking question

How does this affect the suffix trie and the nodes corresponding to the ends of the suffixes? Show answer

Answer: All suffixes will now end with the $ symbol, which doesn’t appear elsewhere in the text. As a result, all suffixes will terminate at leaf nodes, and all leaves will mark the ends of the suffixes. Suffix trie with sentinel characeter

Building a suffix trie

Since a suffix trie contains O(len(Text)2) nodes, building it takes O(len(Text)2) time, and the trie itself requires O(len(Text)2) memory. Naturally, this complexity isn’t ideal—most of the time, it would be faster to simply run the Knuth–Morris–Pratt or Boyer–Moore algorithms for each pattern.

To speed up the construction of the suffix trie, we need to reduce the size of the trie itself, as it’s impossible to construct O(len(Text)2) nodes any faster. Fortunately, this can be done in two independent ways:

  1. Compressing the edges: Instead of storing individual edges for every character, we can compress consecutive edges that form a “chain” without branching. This chain of edges can be replaced by a single edge labeled with the entire substring. The resulting compressed structure is called a suffix tree: Suffix tree
  2. Merging equivalent nodes: Nodes that have identical sets of paths from them to the leaves can be merged. For example, nodes "ssip" and "ip" because if we reach one of these nodes, everything from that point on will be identical (the only path from these nodes to leaves is "pi$"): Suffix automaton: beginning

    Let’s merge these nodes into one!

    Suffix automaton: merging nodes

    Self-checking question

    What other nodes we can merge with these two? Show answer

    Answer: Plenty of them! Nodes "ip", "sip", "ssip", "issip", "sissip", "ssissip", "ississip", and "mississip" all can be merged into one node.

    Continuing merging all groups of nodes which we are able to merge, we’ll get a totally new structure:

    Suffix automaton

    The resulting structure is no longer a tree, but rather finite-state automaton, so this structure is called a suffix automaton.

Suffix tree

Let’s discuss how to build a suffix tree, which is essentially a edged-compressed suffix trie. We’ll use the string "mississippi$" as an example. We start by adding the shortest non-empty suffix "$", then add the two-character suffix "i$", and continue this process until we include the entire string.

The first suffix is easy to add:

Suffix tree: first character

The second and third suffixes are also straightforward:

Suffix tree: second character Suffix tree: third character
Keep in mind that our suffix tree is compressed. That’s why for each new suffix, we added just one edge so far, with the entire suffix written on it.

However, when we try to add the fourth suffix, "ppi$", it can’t be added as a single edge because, in the uncompressed trie, it shares a common prefix with the suffix "pi$". So, we need to split the edge labeled "pi$" into two parts, creating a separate edge for the letter p, and then adding another edge for "pi$".

Suffix tree: split the edge
In general, since the suffix tree is a compressed trie, each node will either be a leaf or have at least two children. There cannot be any nodes with only one child (the only exception is the suffix tree for one-character string "$").

When we try to add a suffix "ippi$" we’ll need to split an existing edge again. Then, "sippi$" is easily added as individual edges from the root, etc.

Suffix tree: split the edge

We can continue adding the remaining suffixes to the tree:

Suffix tree

Notice that splitting an edge and adding a new node does not always happen at the first level!

Self-checking question

What is the maximum number of edges that need to be split when adding a new suffix? Show answer

Answer: Only one. If we add a new node in the middle of an existing edge, the rest of the new suffix will hang off the new node as a single edge.

Self-checking question

Estimate the number of nodes in the suffix tree. Show answer

Answer: For a string Text, the number of non-empty suffixes is equal to len(Text). Since we add at most two new nodes per suffix, the total number of nodes in the suffix tree is at most 2 × len(Text) + 1. Actually, as the tree for two-characters string "a$" has only 3 nodes, the real limit is even less, it is 2 × len(Text) - 1.

Self-checking question

Can you provide an example of a string where the number of nodes in the suffix tree equals 2 × len(Text) - 1? Show answer

Answer: The suffix tree for the string "aaa...aaa$" looks like this: Suffix tree

Exercise

Take a piece of paper or your favorite image editor and build a suffix tree for the string "successlessness$".

How many nodes are in the suffix tree?

 

Building a suffix tree

Recall that an uncompressed suffix trie contains O(len(Text)2) nodes, while the compressed version, a suffix tree, has only O(len(Text)) nodes. That’s a significant improvement! This gives hope that we can build the suffix tree, in O(len(Text)) time.

And indeed, we can. The first linear-time algorithm for constructing a suffix tree was proposed by Weiner, the originator of the suffix tree concept. However, probably the most popular algorithm is Ukkonen’s algorithm. You can read more about it in Ukkonen’s original paper or check out a simplified explanation on GeeksforGeeks or StackOverflow.

There is also a simpler linear-time algorithm for building a suffix tree, but unfortunately, I couldn’t find a description of it in English. We might discuss it in the chat if you want to :-)

If you’re interested in learning more about suffix trees, I also recommend checking out the review article by Robert Giegerich and Stefan Kurtz, as well as the "Efficient implementation of lazy suffix trees" by the same authors.

Use a visualiser if you want to play with suffix trees.

Suffix automaton

A suffix automaton is another approach, distinct from the suffix tree, aimed at compressing a suffix trie into O(len(Text)) nodes, thereby reducing the time needed to build it. This time, we start again with the unsimplified suffix trie but merge nodes together that are equivalent in terms of the strings written on paths from that node to the leaves. As a result, the structure will no longer be a tree, but the resulting directed graph will be acyclic:

Suffix automaton

Since the graph is no longer a tree, the terms root and leaf are no longer appropriate. Let’s refer to the former root as the start node, and the former leaves as final nodes. Actually, these terms come from finite automata theory.

Self-checking exercise

Prove that the resulting graph is acyclic. Show answer

Answer: All suffixes of Text (and only them) are still read along the paths from the start node to the final nodes. If there were a cycle in the graph, it would be possible to find a path that traverses through this cycle. This would allow us to construct an arbitrarily long path from the start node to one of the final nodes, which contradicts the fact that these paths correspond only to suffixes of the text. Cycle in suffix automaton

Merging nodes

Let’s understand how we can determine which nodes can be merged and which cannot. The possibility of merging two nodes depends on the strings that can be read along the paths from these nodes to the final nodes. What is this set of strings? For instance, consider the string "ss" and its corresponding node in the suffix trie for the text "mississippi$":

Suffixes of ss

The strings that can be read along the paths from this node to the final nodes are "issippi$" and "ippi$". In other words, these are the endings that can be appended to "ss" to form a suffix of the original text. Since these endings themselves are also suffixes of the text, instead of the endings themselves, we can represent them by their lengths. In this case, we get the set {len("issippi$"), len("ippi$")} = {8, 5}. For convenience, let’s call this set endings("ss") = {8, 5}.

Now, think about how we can formulate the condition for merging nodes corresponding to strings S and T in our automaton using the endings() function.

 

The answer is simple: two nodes can be merged if endings(S) = endings(T). Otherwise, they cannot be merged because if the sets are not equal, it means that after string T, there is some ending that can lead to a final node that cannot occur after string S. In such a case, merging the nodes is strictly no-no!

After merging all possible sets of nodes, we’ll find that each node in our automaton corresponds to a certain value of the endings() function. Values of this function for different nodes will be distinct.

Another important question is: for which strings can the values of the endings() function coincide? For example, in the case of the string "mississippi$", the following values are equal: endings("ss") = endings("iss") = {8, 5}, but how can we know it?

A bit of math!

Please take a piece of paper and try to prove the following statements on your own.

Exercise 1

Prove that if endings(S) = endings(T), then either S is a suffix of T, or T is a suffix of S. Show answer

Answer: The equality of the values of endings() means that the strings S and T end at the same positions in the Text. This implies that at these positions, one of them must be a suffix of the other.

Exercise 2

Prove that if len(S) < len(T) and endings(S) ∩ endings(T) ≠ ∅, then endings(T) ⊂ endings(S), and S is a proper suffix of T. In other words, for any strings S and T, their endings(S) and endings(T) sets either do not intersect at all, or one is a subset of the other (in which case one of the strings is a suffix of the other). There are no other possibilities! It’s no chances that endings(S) = {8, 5} and endings(T) = {5, 3, 1}.

Exercise 3

If S1, S2, ..., SK are all strings with the same value for the endings() function, then they all have different lengths (this follows from Exercise 1). Let S1 be the longest, S2 be the next longest, and so on, with SK being the shortest. Prove that in this case S2 = S1[1:], S3 = S2[1:], and so on, meaning each subsequent string is the longest proper suffix of the previous one. Show hint

Hint: Strings with equal endings

We have proven some important statements about the endings() function. Now, let’s build the automaton using this knowledge. We will add characters from the string one by one, updating the automaton step by step. For an empty string, the automaton is trivial and consists of just one node:

Empty automaton

Adding the first letter doesn’t complicate things much:

First character automaton

Now, let’s jump ahead! Consider the situation where we have already built the automaton for the prefix Text[:r], and we want to extend it for the prefix Text[:r+1]. For example, we have an automaton for "missi":

Automaton for string missi
Check that all paths from the top to final nodes are suffixes of the string "missi" and only them.

We now want to add the next character from the string "mississippi$", which is s. The automaton should now accept new suffixes: "missis", "issis", "ssis", "sis", "is", and "s". How do we achieve this? We need to take all the current suffixes (which are represented by the paths from the start node to the final nodes) and “extend” them with the letter s. Let’s start with the lowest final node, as it is the easiest to extend:

Automaton for string missis, step 1

When the suffix is extended, the final node moves, and the newly created node becomes the new final node.

With the second final node, it’s more complicated since there is already an edge labeled "s" from it. We can’t just declare this node corresponding string "is" as final because it would incorrectly imply that "mis" is also a suffix of "missis", which isn’t true. How can we resolve this?

The solution is simple: let’s clone this node! The old copy will handle the substring "mis", and the new one will handle the suffixes starting with "is".

Automaton for string missis, step 2, clone

Now, the automaton accepts the following new suffixes: "missis", "issis", "ssis", "sis", "is". The only remaining task is to add the newest suffix, "s". To do this, we simply mark the corresponding node as final:

Automaton for string missis, last step

Let’s now extend the automaton by adding one more character from our text—another s. We add it at the bottom:

Automaton for string mississ, step 1

With the final node on the left, we again can’t simply move the final node along the s edge, because it would incorrectly imply that "miss" is a suffix of "mississ", which isn’t true. So, we clone the node again:

Automaton for string mississ, step 2, clone

With the final node on the right, we can move the edge labeled "s" to the new final node instead of the node we cloned it from:

Automaton for string mississ, step 3, transfer edge

Finally, we need to declare that the string "s" is a suffix, so the rightmost node becomes final again. At this point, we have the automaton for the string "mississ":

Automaton for string mississ

By continuing this process, adding one character at a time, we will eventually obtain the suffix automaton for the string "mississippi$". It can be proven that with each step, we add no more than two new nodes: one at the bottom and possibly one elsewhere as a clone. For the curious: it can be proven that if we clone a node once during the addition of a new character, then it is possible to draw edges from left terminal nodes to the new clone where it’s necessary, and this will be correct.

Therefore, the total number of nodes in the automaton will be O(len(Text)). Moreover, the number of edges will also be O(len(Text)), and the construction will run in O(len(Text)) time. You can read more about this process here.

By adding the $ symbol at the end of the string, we ensure that there will be exactly one final node in the automaton at the end.

Exercise

Prove it!

Exercise

Take a piece of paper (or your favorite image editor, yeah) and build a suffix automaton for the string "successlessness$".

How many nodes are in the suffix automaton?

 

What is the longest string reading from the start node to the node "l" (it has endings() value equals {8})?

 

Searching for a Pattern in a Text

Recall that the suffix automaton is just a modification of the suffix trie, which we needed to preprocess the text in O(len(Text)) time, allowing us to efficiently search for any patterns later. Imagine we’ve already built the suffix automaton for the Text.

To check if the string Pattern exists in the Text, we simply start at the automaton’s start node and attempt to “read” the Pattern by traversing the edges of the automaton. If we successfully read the entire string, it means the Pattern is present in the Text. If at any point we can’t follow an edge, then the Pattern does not exist in the Text.

Search in automaton
Pattern "miss" exists in our text, and pattern "sir" doesn’t.

But how do we find the exact position of Pattern in Text? For example, what if we want to find the first occurrence of Pattern in Text?

Exercise

How can we use the suffix automaton (or suffix tree, for that matter) to find the position of Pattern in Text? Show hint Show answer

Hint: Let’s say after reading the string Pattern in the automaton, we arrive at a node V. The maximum value in the set endings(V) corresponds to the length of the longest suffix that can follow Pattern in Text. In this case, the first occurrence of Pattern starts at the index max(endings(V)) - len(Pattern) + 1.

Answer: See the hint. After that, we can store max(endings(V)) in each node V. This can be done by traversing the automaton’s nodes from bottom to top (which is possible since the automaton is acyclic, refer to topological sorting).

Suffix links

In both suffix trees and suffix automata, we can maintain suffix links between nodes. Exactly same as in Aho–Corasick trie, a suffix link from node V points to another node U such that the (longest) string read from the root to U is the longest proper suffix of the (shortest) string read from the root to V.

Look how suffix links could be added to the suffix tree and suffix automaton.

Suffix links in suffix tree Suffix links in suffix automaton

It’s a mess, yeah, but try to investigate images carefully and understand why suffix links are drawn how they are drawn.

In fact, many algorithms that build suffix structures use suffix links during their construction, so they are created naturally as part of the process. Moreover, having these links helps solve certain problems more efficiently. For example, consider the task of finding the longest common substring of two strings.

Exercise

Design an algorithm to find the longest common substring of two strings using a suffix automaton with suffix links.

Break down and additional materials

A suffix trie stores all suffixes of a fixed text, allowing efficient substring searches. It can answer if a pattern exists in the text by matching it to the beginning of a suffix. A suffix trie has O(len(Text)2) nodes, requiring significant memory and time to build, making it less practical for large texts.

A suffix tree is a compressed version of a suffix trie, reducing complexity to O(len(Text)) by merging consecutive edges and compressing paths, leading to faster construction and smaller size.

A suffix automaton as an alternative reduces the size by merging equivalent nodes, creating a finite-state automaton. It retains the same functionality and as suffix tree also requires O(len(Text)) nodes and edges.

To check if a Pattern exists in the Text, the suffix tree or suffix automaton is traversed, and if the entire Pattern is read along the path, the Pattern exists in Text.

Suffix structures are useful for searching, pattern matching, and other complex string problems, and are built in O(len(Text)) time, ensuring scalability for large texts.

If you’d like to know more about suffix structures, I’d recommend to read:

This lecture by Andrey Gein is licensed under Creative Commons Attribution-ShareAlike 4.0 International

You can distribute, remix, adapt, and build upon the material in any medium or format, even for commercial purposes, but you must give appropriate credit, provide a link to the original content, and indicate if changes were made. You may do so in any reasonable manner.

If you remix, adapt, or build upon the material, you must license the modified material under identical terms.