String Matching

You’re likely reading this text in a browser. Press Ctrl+F (⌘+F on macOS) and search for the word "text" on this page. The browser will instantly show you how many times the word appears. Even in texts hundreds of times longer than this page, browsers can quickly find the desired substring. Today, we’ll look at the algorithms that make this possible.

Of course, these algorithms aren’t just used in browsers but also in text editors, in the grep utility, in standard libraries of programming languages and even in bioinformatics–at last, DNA is just a string, and genes are its substrings.

We’ll start with a simple naive algorithm, then gradually improve and refine it. But first, let’s agree on the terminology we’ll use. If you’re already familiar with basic string notations, alphabets, prefixes and suffixes, you can skip this section or return to it later when something will seem to be unclear.

Let’s start from the beginning. A string is a sequence of characters (also sometimes referred to as letters or symbols), and an alphabet is the set of all characters that can appear in strings. For example, if the alphabet consists only of lowercase Latin letters, then "hello" and "world" are strings (the quotation marks just frame the words and aren’t part of them), but "Hello world" isn’t because the alphabet doesn’t include uppercase letters or spaces. The alphabet, whatever it is, is typically denoted by the symbol Σ.

In everyday life, we usually deal with meaningful texts made up of sentences and words. However, for the purposes of this lecture, a string will refer to any sequence of characters, regardless of meaning. In computer science and programming, strings are often treated without trying to extract any semantic meaning from them. This approach introduces certain limitations but also opens up additional possibilities—for example, you can search not only for a word in a novel but also for mutations in a genome or a specific sequence of bytes in a file on a disk. So yes, "abacadabacaba" is also a string, and a very good string!

If S is a string, then len(S) is its length. For example, len("Big Ben") is 7. We’ll also use the notation S[5] to refer to the sixth character in the string, as indexing starts at zero in most programming languages. It’s also useful to borrow the concept of slicing from programming: S[10:15] refers to the substring of S starting at S[10] and ending at S[14].

You can concatenate two strings using the + symbol. For instance, "Einer ist " + "keiner" = "Einer ist keiner".

A special case is the empty string consisting of zero characters, which is denoted by ε.

A string S is a substring of another string T if some slice of T is equal to S. String S is a prefix of T if this slice starts at the first character. And it’s a suffix if it ends with the last character.

For example, the string "London" is a prefix of "London is the capital of United Kingdom", and "ted Kingdom" is a suffix of it. The empty string ε is both a prefix and a suffix of any string, including itself. In fact, any string is both its own prefix and suffix, so we distinguish proper prefixes and proper suffixes, meaning they don’t match the entire string. Sometimes, empty string is also not considered as a proper prefix or suffix, but for our purposes let’s count them.

For convenience, we’ll sometimes refer to the long string we’re searching in as the text.

We will also need the concept of lexicographical or alphabetical order. To define it, the alphabet must have a strict order across characters, like: a < b < ... < z < A < B < ... < Z < " ". A string S is lexicographically less than a string T if they share a common prefix (possibly empty), and at the first position where they differ, the character in S is less than the character in T. For example, with the order of characters above, the string "stay foolish" is lexicographically less than "stay hungry". Naturally, this is shortly denoted as "stay foolish" < "stay hungry". By the way, this relation is a strict order relation on a set of all strings.

In the special case where one string is a proper prefix of the other, we consider the prefix always less than the full string: "You shall not" < "You shall not pass".

Fun zone!

I asked language models to turn this lecture into a podcast where two people discuss today's topics. Play it during a walk — learning algorithm theory has never been this easy!

Note: All the text, sound, and even video were generated by algorithms. There are inaccuracies and even (gasp!) mistakes. Please treat it as entertainment. It's by no means a replacement for the full written material :-)

Let’s return back to our problem of finding occurrences of a given substring within a text. Fortunately, it has an obvious solution: we can simply go through the text and try "aligning" the target string at every position. At each alignment, we check if all the characters match. If even one character is different, we move the string over by one position and try again.

Here’s an example of how to implement this algorithm. Please, spend some time exploring this code snippet (and others you come across). To play with it, try editing the text or pattern, and then run the code. You can view the full code by clicking the "+" icon at the top of each snippet.

//sampleStart
val text = "Eighty percent of success is showing up"
val pattern = "success"

fun findFirstOccurrence(pattern: String, text: String): Int? {
    for (i in 0..(text.length - pattern.length)) {
        for (j in pattern.indices) {
            if (pattern[j] != text[i + j]) {
                break
            }
            if (j == pattern.length - 1) {
                return i
            }
        }
    }

    return null
}
//sampleEnd

fun main() {
    println("Found pattern \"${pattern}\" in text \"${text}\" at offset ${findFirstOccurrence(pattern, text)}")
}

To deepen your understanding, ask yourself a few questions about how the code works. For instance, how would you modify it to find the pattern case-insensitive? Or to allow for up to two errors in the pattern? Practice also coming up with your own questions!

How bad is this naive approach? It’s easy to see that this algorithm requires no extra memory (which is great!), but in the worst-case scenario, it takes O(len(Pattern) × len(Text)) time to execute. That said, for most real-world cases, this algorithm will perform much faster than O(len(Pattern) × len(Text)) because mismatches will often be detected in the first few characters at many positions.

Self-checking question

Provide examples of Pattern and Text strings where the naive algorithm actually performs O(len(Pattern) × len(Text)) operations. Show hint Show answer

Hint: Consider Text = "aaaa...aaa".

Answer: For instance, Text = "aaaa...aaa" and Pattern = "aaa...aab".

Self-checking question

What’s the minimum number of operations this algorithm could perform? And what’s the minimum number of operation this algorithm could perform if there are no occurences of Pattern in Text? Show answer

Answer: The best time is O(len(Pattern)). For the second question, the best time is O(len(Text)). It will be achieved with Pattern started with the letter which even doesn’t exist in the Text.

It might seem that the naive algorithm is impractical and not used anywhere. However, this isn’t true: it performs excellently on short strings with many occurrences (for example, searching for "\r\n" in a long text). It also doesn’t use any extra memory, which is why for many years it was the primary algorithm for string searching in glibc (GNU C Library), and by extension, in C and C++. Even now, the naive algorithm is still used under the hood in many JDK (Java Development Kit) versions when you call the String.indexOf() method in Java or String.contains() in Kotlin.

Self-checking question

Find the source code for String.indexOf() method from OpenJDK and verify that the naive algorithm is used. Show hint Show answer

Hint: Source codes of OpenJDK are available on https://github.com/openjdk/jdk

Answer: See StringLatin1.java

Additionally, this algorithm efficiently uses processor caches and can even be parallelized on a single core, thanks to special processor instructions like SSE or AVX:

While the naive algorithm works well in some cases, it’s not suitable for fast searches of large patterns or in large documents. It’s main issue is that after each mismatch, the substring is only shifted by one position to the right, and the entire process starts over. What can be changed here to make the algorithm more efficient? One of two things: we can either compare the pattern with the substring differently (for example, right-to-left instead of left-to-right), or, in case of a mismatch, shift the pattern by more than just one position to the right. Let’s give it a try!

...in case of a mismatch, shift the pattern by more than just one position to the right...

Let’s take a closer look at this. If a mismatch occurs at the very first character during the alignment of the pattern with the text, it’s not a big deal—we haven’t wasted much time, and it makes sense to simply shift the pattern by one position to the right for the next attempt.

However, if we have a significant match, say 3 characters, it’s likely that nothing will match at the next position, and we’d ideally want to shift the pattern as far as possible—ideally by those 3 characters!

Unfortunately, it’s easy to come up with an example where such a large shift would cause us to miss an actual occurrence of the substring.

So, we need to figure out how far we can safely shift the pattern without missing anything important. Assume on the previous attempt we matched r characters, but the (r+1)-th character didn’t match. In other words, Pattern[:r] = Text[i–r:i] and Pattern[r] ≠ Text[i].

Shifting the pattern

Let’s try shifting our pattern by a distance smaller than the length of the matching part, meaning that it overlaps with Text[i–r:i]. This only makes sense if overlapping substrings are identical (otherwise there will be no match 100%). Since Text[i–r:i] = Pattern[:r], this overlapping substring will inevitably be both a proper prefix and a proper suffix of Pattern[:r].

The key idea of the Knuth–Morris–Pratt algorithm is that we can determine this shift in advance, without even looking at the text we’re searching. For each possible r from 0 to len(Pattern), we can calculate π(r)—the length of the longest string that is both a proper prefix and a suffix of Pattern[:r+1].

	0	1	2	3	4	5	6	7	8	9	10	11
Pattern[r]	a	n	a	n	a	b	a	n	d	a	n	a
π(r)		0	1	2	3	0	1	2	0	1	2	3

Note that π(r) < r, and at π(0) is undefined.

Self-checking question

Why π(7) = 2 in the example above? Show answer

Answer: Because the longest proper prefix of the string "ananaban" which at the same time is the suffix of "ananaban" is "an".

Self-checking exercise

Fill the table of the prefix function π for the string "abracadabra".

	0	1	2	3	4	5	6	7	8	9	10
Pattern[r]	a	b	r	a	c	a	d	a	b	r	a
π(r)		0	0	1

Main algorithm

Knowning all the values of π(r), the Knuth–Morris–Pratt algorithm becomes straightforward. You only need to maintain two indices, i and r, such that Text[i–r:i] = Pattern[:r] (and r is the largest possible value).

If the next characters also match (Text[i] == Pattern[r]), you simply increment both i and r. If they don’t match, you set r = π(r) and leave i unchanged (unless r = 0, in which case you have to go to the next character by increasing i).

val text = "The lady checked in the luggage ananas, banana, bandana, ananabandana, and a little dog"
val pattern = "ananabandana"
val π_values = intArrayOf(-1, 0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3)

fun π(r: Int): Int {
    return π_values[r]
}

fun main() {
    //sampleStart
    var r = 0
    for (i in 0 until text.length) {
        while (r > 0 && pattern[r] != text[i]) {
            r = π(r)
        }
        if (pattern[r] == text[i]) {
            r += 1
        }
        if (r == pattern.length) {
            println("Found pattern \"${pattern}\" in text \"${text}\" at position ${i - pattern.length + 1}")
            r = π(r-1)
        }
    }
    //sampleEnd
}

At first glance, it might seem like the line r = π(r) could be executed len(Pattern) × len(Text) times in the worst case. However, that’s not true. In reality, the entire code runs in Θ(len(Text)) time.

Exercise

Assuming that all values of π(r) are known, prove that the entire code above runs in Θ(len(Text)) time. Show hint

Hint: Every time when r is updated in line 4 (r = π(r)), r is decreasing. Initially, r = 0 and it never becomes negative.

Precalculating π

Of course, you don’t need to call the π(r) function each time. Since the π values depend only on the pattern and not on the text, you can precompute them and store them in an array of size len(Pattern). How to do so?

Let’s start from the beginning: π[0] is undefined, π[1] = 0. Now let’s imagine we know all the values π[1], π[2], …, π[i–1]. According to its definition, we’d like to find the largest possible r such that Pattern[i–r:r] = Pattern[:r].

Hmm, “find the largest possible r such that Pattern[i–r:r] = Pattern[:r]”. Does this sound familiar?

Exactly! This is exactly what we were doing in the main algorithm, except there we were looking for the largest r such that Text[i–r:r] = Pattern[:r]. So, just like before, at each step of the algorithm, we maintain two indices, i and r, such that Pattern[i–r:i] = Pattern[:r], and r is the largest such value.

//sampleStart
val pattern = "to begin to toboggan first buy a toboggan, but don't buy too big a toboggan. too big a toboggan is too big a toboggan to buy to begin to toboggan"

fun calculate_π(pattern: String): IntArray {
    val π = IntArray(pattern.length)
    π[1] = 0
    var r = 0
    for (i in 1 until pattern.length) {
        while (r > 0 && pattern[r] != pattern[i]) {
            r = π[r]
        }
        if (pattern[r] == pattern[i]) {
            r += 1
        }
        π[i] = r
    }

    return π
}
//sampleEnd

fun main() {
    println("Prefix function for the string \"${pattern}\" is:")
    println(pattern.toCharArray().contentToString())
    println(calculate_π(pattern).contentToString())
}

Just like the main algorithm, this preprocessing step takes Θ(len(Pattern)) time. It also uses Θ(len(Pattern)) extra memory.

Let’s break this down

The naive algorithm checks every single position in the text, one by one, which results in significant inefficiency for large documents.
The Knuth–Morris–Pratt algorithm, however, avoids redundant comparisons by calculating the prefix function for the pattern beforehand. This allows it to shift the pattern intelligently after each mismatch, significantly reducing the time complexity of the search.
π(r) helps determine the longest prefix of the pattern that also matches a suffix, enabling faster realignment when searching for the next potential match.

Additional materials about Knuth–Morris–Pratt algorithm

Visualization of Knuth-Morris-Pratt Algorithm from "G Bhat" channel:
Some other application of the prefix function
Visualization by the University of British Columbia

The Knuth–Morris–Pratt algorithm runs in Θ(len(Pattern) + len(Text)) time, no more and no less, even at the best case, since it must calculate the prefix function of the Pattern and then look at every character of the Text. But can we do better? It turns out that sometimes we can!

Self-checking question

One might wonder why we’d want to improve beyond Θ(len(Pattern) + len(Text)) if we still have to read both the Pattern and Text into memory, which obviously takes Θ(len(Pattern) + len(Text)). What’d you answer? Show answer

Answer: Well, yes, as you might guess, sometimes there’s no need to load the entire text into memory. We can work directly with a file, reading only specific bytes, or use external memory that provides efficient access to its contents. So let’s try to do our best and achieve better time complexity in some cases.

As previously, we align the pattern with the text and compare the corresponding characters. If they don’t match, there’s no point in checking further and we can immediately shift the pattern to the right. We don’t want to skip a potential match, so we should choose the shift carefully.

In the Knuth-Morris-Pratt algorithm, we’ve been comparing characters of the pattern and the text from left to right, and upon finding a mismatch, we shift the pattern according to the prefix function. But what if we would still move our pattern from left to right, but would compare the characters in the opposite direction, from right to left? In that case, the mismatch situation would look similar but mirror-like: we would have matching suffixes (ba below) and two different characters (e and m).

Let’s look at the marked letters a. This time they’ve matched, and it’s great! We’d like to shift our pattern to the right, but only that way to line up a from the text with some other a from the pattern.

Self-checking question

How many positions should the pattern be shifted to the right so that the letter a from the text lines up again with an a in the pattern? Could we precalculate this shift before running the algorithm? What if the text was not "awesomebanana" but "awesomebmnana"? Could we precalculate the shift in this case either if we want to align the letter m with some m from the pattern again?

Please, take a moment to think about questions above before scrolling further.

Indeed, the answer to the question "how many positions should the pattern be shifted" depends only on the pattern and the marked letter a, so this shift can be precomputed before the search begins!

It’s obvious (isn’t it?) that to do the meaningful shift (when a will align with other a again), we need to find the rightmost occurrence of the letter a in the pattern (if a is also the last letter of the pattern, as in our case, then we look for the second rightmost occurrence, otherwise we won’t shift the pattern at all). All other shifts can be safely skipped because they will certainly not match:

Next time, we’ll try to match the pattern and the text at new position. Fortunately, they mismatch straight away because n ≠ a

Moreover, after that fail we can safely skip 4 next alignments, since the pattern has no letters n in it, and there is nothing to align with n in the text. As a result, we can shift the pattern by 5 positions.

As with Knuth–Morris–Pratt, the key point here is that the size of the shift doesn’t depend on the text but only on the pattern and the marked letter of the text (a and n in two examples above). Let’s precompute the shift size for all possible letters by finding the rightmost occurrence of each letter in the pattern:

fun main() {
    //sampleStart
    val pattern = "Banana anomaly"

    val shiftTable = IntArray(256) { pattern.length } // Default shift is the length of the pattern
    for (i in 0 until pattern.length - 1) {
        shiftTable[pattern[i].code] = pattern.length - 1 - i
    }
    //sampleEnd

    println("Shift table for pattern \"${pattern}\" (other values are equal to ${pattern.length}):")
    for (code in 0 until 256) {
        if (shiftTable[code] != pattern.length) {
            println("  ${Char(code)} → ${shiftTable[code]}")
        }
    }
}

Don’t forget to run the code above and to interpret the results

To precompute such an array, we need to know the size of the alphabet. In the snippet above, for instance, we assume that both the pattern and the text only contain characters with codes between 0 and 255. If we want to search across the entire Unicode range, the array would need to be hundreds of times larger. In such cases, using a hash table might be more efficient for smaller patterns.

And yes, that’s all! Just shift our pattern by shiftTable[Text[i].code] and check the corresponding substring of the text for matching again. This simple string-search algorithm is called the Boyer–Moore–Horspool algorithm. The best part here is that we ignored some characters of the Text entirely. Really, we haven’t even looked at them!

Self-checking question

What is the worst-case complexity of this algorithm? Show answer

Answer: O(len(Pattern) × len(Text)). Could you suggest corresponding Pattern and Text?

The advantage of this algorithm is that, in real-life scenarios with typical texts and patterns, most comparisons end very quickly, and the pattern is shifted by a significant number of characters—often by the full length of the pattern, or by half or a third of it. As a result, the algorithm’s complexity is close to its best case — O(len(Text) / len(Pattern)).

Self-checking question

Give an example of a long text and pattern where the best-case complexity is achieved. Show answer

Answer: For instance, when none of the characters in the pattern appear in the text. Or if you’re searching for a substring like ". A" in a normal text (where space and period often precede uppercase letters), the complexity will be O(len(Text) / len(Pattern)).

Actually, the Boyer–Moore–Horspool algorithm is a simplified version of the most advanced Boyer–Moore algorithm, which uses two heuristics to determine the shift size instead of one. The idea described above is called bad character rule, and the Boyer-Moore algorithm also uses so-called good suffix rule. What is it? Let’s imagine that we have matching suffixes ana up to the point where a mismatch occurs.

How many positions can we shift the pattern so that the characters ana appears under ana again?

Yeah, we just need to find the substring ana somewhere else in our pattern! Difference between indices of these positions will be our shift. In that case we should shift by two positions. And, indeed, this difference can be calculated before we even start searching, as it only depends on the pattern’s own suffixes, not on the text.

Exercise

Suggest an algorithm to precompute the shift table for this heuristic. For each suffix of the pattern, the table should indicate the position of the rightmost occurrence of an identical suffix elsewhere in the same pattern. Show hint

Hint: This task is similar to building the prefix function, but in reverse ;-) Also ideas from Z-function might help you.

Besides that, the Boyer–Moore algorithm applies a slightly different version of the bad character rule. More precisely, it looks not at the character aligned with the last symbol of the Pattern, rather than at the character where the mismatch happens. If mismatch happens at Text[i–r] (where i is the end position of the current alignment), then the algorithm considers shifting by shiftTable[Text[i–r].code] – r instead of shiftTable[Text[i].code] in the Boyer–Moore–Horspool algorithm. Yes, the result of the subtraction can be less than zero, but we don’t care a lot about it as in that case good suffix rule suggests a better shift.

When both heuristics, bad character rule and good suffix rule, are ready, the algorithm chooses the best of them on each iteration. Heuristic is better if it suggests the largest shift of the pattern.

The Boyer–Moore–Horspool and Boyer–Moore algorithms perform exceptionally well on real-world texts, such as those we encounter in daily life. That’s why they are used in tools like GNU grep, browsers page search, and similar applications.

For the curious: while it’s difficult to trace the entire chain of functions from pressing Ctrl+F to the actual algorithm, here’s the Boyer–Moore implementation in Chromium’s source code: src/strings/string-search.h.

Small exercise

As for the GNU grep source code, it’s much easier to find the Boyer–Moore implementation there—you can check it yourself. Could you find it? Show answer

Answer: See kwset.c.

By the way, here’s the message from the GNU grep author explaining why the tool is so fast.

... #1 trick: GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE. #2 trick: GNU grep is fast because it EXECUTES VERY FEW INSTRUCTIONS FOR EACH BYTE that it *does* look at. GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character. GNU grep also unrolls the inner loop of Boyer-Moore, and sets up the Boyer-Moore delta table entries in such a way that it doesn't need to do the loop exit test at every unrolled step. The result of this is that, in the limit, GNU grep averages fewer than 3 x86 instructions executed for each input byte it actually looks at (and it skips many bytes entirely). ...

Additional materials about Boyer–Moore algorithm

An overview video by Ben Langmead:
Original paper by Robert S. Boyer and J. Strother Moore
Modification of the algorithm for small alphabets (e.g. for DNA)
Rationale behind Boyer–Moore algorithm by Andrei Gudkov

Unfortunately, both the Knuth–Morris–Pratt and Boyer–Moore algorithms share a common drawback: they require a non-constant amount of additional memory. That’s the reason why they aren’t implemented in standard libraries of many programming languages. But is it possible to avoid using extra memory altogether? It turns out, yes, it is!

As with all previous algorithms, we will overlay the pattern onto the text, compare it, and if any character doesn’t match, shift the pattern. However, this time, instead of comparing from left to right or from right to left, we will compare from the middle outwards!

To do so, we will cut the pattern into two parts. During each alignment we will start with matching the right part first. If it matches, let’s match the left part as well.

Matching the right part

We start by scanning the right side of the pattern to check if it matches the corresponding characters in the text, moving from left to right.

When a mismatch occurs, we skip ahead by the length of the matched portion plus 1 to minimize unnecessary comparisons. In that case, by 4, because substring "wor" matched.

"Wait, wait," you might say. "Can we really do that?". And I’ll ask it differently: under what conditions could we do that?

Exercise

Think about this question before reading further. Could you suggest conditions for the pattern and the cut which allows us to shift the pattern in that way?

It turns out that this approach works if suffixes of the left part hello are not equal to prefixes of the right part world:

If we can cut the pattern carefully enough to ensure the inequalities hold, we’ll be able to skip quite a few characters!

Skipped alignment thanks to inequality.

Let’s note that whether these inequalities hold or not depends solely on the pattern and how we split it.

This core idea hints at how the algorithm can achieve linear-time complexity. However, if the left half of the pattern is shorter than the right, then we could run into something like this:

Can we skip ahead by 4 characters? We actually can, so long as

The question mark ? here represents "wildcard" that always matches; it’s outside the limits of the pattern, so there’s no way for it to invalidate a match. To ensure that the inequalities above are always true, we need them to be true for all possible ? values. We thus need lo != or and lo != rl, etc.

Matching the left part

Once we have ensured the right part matches, we scan the left part (order doesn’t matter, but traditionally right-to-left), and if we find a mismatch, we jump ahead by max(len(left_part), len(right_part)) + 1. That we can jump by at least len(right_part) + 1 we have already seen above.

But we can also jump by at least len(left_part) + 1, in that case by len("hello") + 1 = 6.

This, again, requires the facts written on the right side of the illustration above ↑.

If we have this set of inequalities, then we can indeed jump by max(len(left_part), len(right_part)) + 1.

The problematic case

The sets of inequalities listed so far seem too good to be true in the general case.

Self-checking exercise

Provide a few examples of strings that can’t be split effectively. Can you come up with a general rule for the types of patterns that can’t be divided for satisfying all the inequalities?

Please, take a moment to think about example before scrolling further.

Indeed, they fail when a pattern is periodic: there’s no way to split Pattern = "aabaabaaba" in two such that (the k characters to the left of the split) ≠ (the k characters to the right of the split) for all k.

This is because no matter how you cut it, you’ll get Pattern[cut–3:cut] = Pattern[cut:cut+3]. So what do we do? We still cut the pattern in two so that k can be as big as possible. If we were to split it as aaba + abaaba then a = a around the split, so this is bad (we failed at length 1). But if we split it as aa + baabaaba, we at least have a ≠ b and aa ≠ ba, and we fail at length 3 since ?aa = baa. Note, that we already knew that a cut to make 3 mismatches was impossible due to the period, but we now see that the bound is sharp; we can get 1 and 2 to mismatch.

This is exactly the content of the critical factorization theorem: that no matter the period of the original pattern, you can cut it in such a way that (with the appropriate question marks) Pattern[cut–k:cut] ≠ Pattern[cut:cut+k] for all k less than the period.

Even non-periodic strings are periodic with a period equal to their length, so for such patterns, the critical factorization theorem already guarantees that the algorithm described so far will work, since we can cut the Pattern so that the chunks of length k on either side of the cut mismatch for all k < len(Pattern). Looking closer at the algorithm, we only actually require that k go up to max(len(left_part), len(right_part)). So long as the period exceeds that, we’re good.

The more general shorter-period case is a bit harder. The essentials are the same, except we use the periodicity to our advantage by "remembering" periods that we’ve already compared. In our running example, say we’re looking for "aabaabaaba" in "bbbabbaabaabaabbbaabaabaabaa".

We cut the pattern as aa + baabaaba, and then the algorithm runs as follows:

On first alignment, we mismatched at third position of the right part, so jump by len("ba") + 1 = 3. That requires that a != b and aa != ba.

On second alignment, the entire right part matched, so we started matching of the left part. We mismatched at the most right character of the left part, so jump forward by a period (which is 3), remembering the existing comparisons.

There’s a "memory" here: a bunch of characters already matched. There are two more characters on the right part which match beyond that. The eighth character of the right part mismatched, so jump by 8!

Self-checking question

Why could we do it?

The rule here is more complicated than usual: we don’t have the right inequalities for lengths 1 through 7, but we do have shifted copies of the inequalities a ≠ b and aa ≠ ba, along with knowledge of the mismatch. We can skip all of these alignments at once:

The skip by 8 in example above can be generalized: if the Pattern has a period of p, the theorem guarantees that we can maintain the inequality property for lengths 1 through p–1. Jumping by p will align both the matching and mismatching characters with their counterparts from one period earlier.

Inductively, this shows that we can skip by the number of matched characters in the right part, plus one, just as in the original algorithm.

To clarify, the "memory" is set whenever the entire right part matches, and this becomes the starting point for the next alignment. In such cases, the alignment jumps forward by one period, and the right part matches except for possibly the last period. Moreover, if we cut the Pattern so that the left part is shorter than the period (which is always possible), we can be sure that the left part already matches. The memory resets to zero whenever there’s a mismatch in the right part.

It’s quite complicated but if you want to understand it deeper, feel free to read the original paper.

Computing the cut

Our remaining tasks are now to compute a cut of the pattern with as many inequalities as possible. The computation is relatively simple, however proving its correctness is not.

To understand how to correctly split our Pattern, let’s imagine that we split it incorrectly (easy, yeah?), meaning two equal substrings appear on either side of the cut. This implies that the pattern can be represented as a + w + w + b, where w is the repeated substring. Now, let’s examine three suffixes of our pattern: w + w + b, w + b, and b. Notice that either:

w + w + b < w + b < b, or
b < w + b < w + w + b

There are no other possibilities—prove it!

In other words, the suffix w + b will always be lexicographically between the other two suffixes. If we split the pattern such that the right part is the largest of all possible suffixes, we will effectively protect ourselves from encountering such strings w, achieving the ideal cut.

While this reasoning above is generally correct, it doesn’t take into account certain nuances—like question marks, for instance. To address these details, Crochemore and Perrin, authors of this algorithm, introduced a clever trick: instead of just finding the largest suffix once, we need to do it twice—first for the normal order of characters (e.g., a < b < c < ... < z), and then for the reverse order (z < y < x < ... < a). How does it help? See Crochemore & Perrin’s Theorem 3.1 for details!

Essentially, for building a good cut (which is called a critical factorization in the corresponding theorem) we’re doing this:

// Get the lexicographically maximum suffix of the pattern
val suffix1 = (0 until Pattern.length).map { Pattern.substring(it) }.max()
val suffix2 = ... # the same as above, but invert the alphabet

val cut1 = Pattern.length - suffix1.length
val cut2 = Pattern.length - suffix2.length
val cut = maxOf(cut1, cut2)  // The later cut

Self-checking question

Could we calculate suffix2 by saying .min() instead of .max()? Show answer

Answer: No, "invert the alphabet" is different from saying .min() instead of .max(), since in lexicographic order, we have "he" < "hello" even if the alphabet is inverted.

Let’s put it all together now

The Two-Way Algorithm is an efficient pattern-matching algorithm that avoids using extra memory by cleverly leveraging the structure of the pattern. Unlike Knuth-Morris-Pratt or Boyer-Moore, it achieves this by comparing parts of the pattern starting from the middle and working outward, allowing for faster shifts when mismatches occur.

Key steps are the following:

Split the Pattern. The pattern is divided into two (typically unequal) parts based on a critical factorization. To find an optimal split, algorithms finds the lexicographically largest suffix of the pattern.
Aligning. As usually, we align our pattern to the text and try to shift it efficiently if mismatch occurs.
Match the Right Part First. On each alignment, we start with matching the right part of the pattern against the text, starting from left to right. If a mismatch occurs, shift the pattern by the number of matched characters plus one.
Match the Left Part. If the right part fully matches, proceed to check the left part. If a mismatch is found, shift by a distance of max(len(left_part), len(right_part)) + 1.
Handle Periodic Patterns. For periodic patterns, adjust the cut based on the critical factorization theorem, which guarantees that mismatches can still be leveraged to skip ahead by the pattern’s period.
Use "Memory" for Matching. When matching a periodic pattern, "remember" the results of previous alignments and use this memory to skip over already matched portions of the text.

The Two-Way algorithm is superior to the Knuth-Morris-Pratt and Boyer-Moore algorithms in that it uses a constant amount of additional memory. Because of this, it is often implemented in standard programming language libraries, such as Python or glibc, musl and others.

If you want to know more about this algorithm, feel free to look at the following additional materials:

Original paper by M. Crochemore and D. Perrin
Finding the Leftmost Critical Factorization on Unordered Alphabet by D. Kosolobov. Unfortunately, this algorithm is not in-place.

Throughout most of this lecture, we focused on algorithms that search for the first occurrence of a pattern in a text. However, in real-world scenarios, we often need to find all occurrences, not just the first one. It’s fairly straightforward to modify these algorithms to find multiple occurrences, but it’s important to remember that overlaps may exist, and the user may only want to find non-overlapping matches.

Self-checking question

Imagine you need to find the last occurrence of a pattern in the text. How would you approach solving this?

It’s also worth mentioning that working with strings goes far beyond just pattern matching. This is a vast area studied by thousands of researchers worldwide. For instance, there are interesting problems related to fuzzy search (or approximate string matching), fast matching and replacement based on regular expressions, string compression, and computing edit distance, to name a few. If you enjoy working with strings, there’s always plenty to explore!

Do you want to know more?

I have the part two! In the second part, we’ll be looking at how to search for multiple patterns in a single text. This is how fast search works in your favorite IDEs, some algorithms in bioinformatics, and other tasks are solved in similar ways.

Strings Matching. Beginning

Andrey Gein / 2024

Fun zone!

Naive algorithm

Knuth–Morris–Pratt Algorithm

Shifting the pattern

Main algorithm

Precalculating π

Let’s break this down

Additional materials about Knuth–Morris–Pratt algorithm

Boyer–Moore algorithm

Additional materials about Boyer–Moore algorithm

Two-Way Algorithm

Matching the right part

Matching the left part

The problematic case

Computing the cut

Let’s put it all together now

Final thoughts

Do you want to know more?